sorry for the X-Post
in the kdnuggets.com Forum.
currently i am planning a data mining project for a course at the university. My idea was to predict the
success of music bands by their history of album releases - where in this case I define "success" not as the billboard
success but rather how long the band remains in bussiness. I.e. one-hit-wonders vs. bands which releases several albums for
say about at least 5 years.
Some questions in detail would possibly be "How important is the interval from the first album to the second album of a band?" or "Is it necessary to stay at one music label over the period of some albums?" etc.
My problem so far is the modelling of this data for further processing. The first approach to represent a feature vector for each band is:
Name, Begin (year of first Album), End (year of last Album), County, Number of Albums, Average interval between 2 albums
But now i have no clue how i can add the information about the individual albums of the band. For each album i would like to provide features like
- year of release
- season of this year
- country of the first release
- music label
- and maybe some more informations
If i just sequentially add this coulmns for each album this would result in different length of the feature vector for each band. Maybe it would be sufficient to simple set the columns of "short vectors" (i.e. bands which haven't released much albums) with a defined "blank symbol"!?
My idea was to build a decision tree, which indicate the features that predict a band will remain over a longer time period in the bussines.
Maybe sombebody has an idea or an advise for me, which i can represent the data for my project. I hope my problem about the data has become clear.
Big thanks in advance!