Distinguishing two datasets
Posted by: Anton
Date: September 27, 2013 11:57AM

I have two datasets from some Web store (like Amazon). Datasets have one and the same structure. Each record in these datasets has the following attributes:

ID - user ID
ProductCnt - count of products user bought
DepartmentCnt - count of departments user was shopping
PayTotal - sum of payments made by user
PayCnt - count of payments
MaxPay - maximum payment

The first dataset is a collection of records related to different users randomly selected. The second dataset is a collection of records related only to users who also clicked on a particular advertisement located on the same page as the product they where shopping for.

Problem: Find dependencies that distinguish users in second dataset from users in the first dataset.

To solve this I would calculate statistical parameters such as mean, expected value and standard deviation for all parameters in both datasets and compare them.
Any other ideas how to find characteristic features distinguishing these datasets?
Thanks!

Re: Distinguishing two datasets
Posted by: Anton
Date: September 28, 2013 08:48AM

Any other ideas how to find characteristic features distinguishing these datasets?

I am new to this kind of problems, so please bear with me, and also let me know if my question makes no sense at all! Thanks!

Re: Distinguishing two datasets
Date: September 28, 2013 10:57AM

Hi,

I look at the attributes, and I'm afraid that you may not have enough information to distinguish between the two types of user.

If you could have additional information about the users such as their age, the kind of products that they bought, the type of advertisement, etc., you could probably make some better distinction.

That is my opinion.

Philippe

Re: Distinguishing two datasets
Posted by: anton
Date: September 28, 2013 12:21PM

Let's say I have these additional attributes about the users (age, the kind of products that they bought, the type of advertisement, etc.,).
What method should I use to answer the question "What feature inter-dependencies distinguish users in the second table from users in the first one?"

Re: Distinguishing two datasets
Date: September 28, 2013 12:25PM

You can see that as a classification problem where you have training data and you need to classify new customers according to whether they will click or not.

There exists several classification algorithms such as:
- decision trees,
- neural networks,
- naive bayes classifier,
- SVM,
- ... etc.

One way to do it would be for example, to generate a decision tree by using your data.

This is just a general idea. You could perhaps find some better ideas that would be more suitable with your data.

Philippe

Re: Distinguishing two datasets
Posted by: anton
Date: September 28, 2013 01:11PM

Unfortunately, classifier (no matter what classification algorithm one uses) does not give an answer to the question "What feature inter-dependencies distinguish records in the second dataset from records in the first one?"

Re: Distinguishing two datasets
Date: September 28, 2013 04:56PM

I don't agree.

For example, a decision tree will build a tree where each node that is not a leaf correspond to an attribute and the subbranches correspond to an attribute value. The leaf node will correspond either to the first dataset or the second dataset (to do that you just need to use value "dataset1" as class label for instances from dataset1 and "dataset2" as class label for instances from dataset2 when building the tree).

If you mean that a decision tree does not consider several attributes at the same time, then there are some other classifiers that will find dependencies between values for various attributes. For example, CBA (classification by association) find association rules between various attributes to perform classification. THe association rule can be interpreted by humans. For CBA the class label would be Dataset1 and Dataset2.

A neural networks will also find dependencies between various attributes (weight of nodes) and could predict "dataset1" or "dataset2". But they cannot be easily interpreted by humans.



Edited 3 time(s). Last edit at 09/28/2013 04:59PM by webmasterphilfv.

This forum is powered by Phorum and provided by P. Fournier-Viger (© 2012).
Terms of use.