Hi there,
I’m writing this first post basically because I’m getting crazy in analyzing a set of quite complicated data
It’s been several years that I’ve been dealing with data mining issues, but most focused on classification and association problems. In this case I have to develop a multivariate regression model in order to explain how investments in marketing, communication and HR training impact on commercial output (e.g. sales) of different kind of products. The purpose of the model is of course explicative, but also predictive as the final result must be a sort of simulation dashboard (useful for planning the investments in future campaigns).
Structure of data: longitudinal, time series cross sectional weekly observations (3 years long) for 5 different product categories. So I have about 750 observations. Dependent variable: product category sales; independent variables: about 200 potential predictors.
Ok, nothing particularly strange.. but..
- For each product category, the sales are VERY differentiated: for example, for one category they could be thousands of units (on weekly base) for 40 weeks long (and the rest equal to zero), while for another category I have hundreds of units weekly, but along 100 weeks. So the dependent variable is full of zeros across time, it depends on product category.. It’s very inhomogeneous!
- Also the predictors are not homogenous: I have some quantity that are different from category to category (e.g. investments in Paid Search on the internet), while others vary across time, but are equal across product categories (e.g. customer satisfaction index); lot of null values also in the predictors.
- Most of the predictors are strongly correlated with each other;
- Within each category, the dependent variable is autocorrelated;
Ok, this is the framework. I tried a lot of models (arimax, glm, pooled regression..), but nothing worked fine. Poor fit and poor interpretation. Well I need some good tips.. any idea?
Thanks for your help!
Milo