The Data Mining Forum                             open-source data mining software data mining conferences Data Science for Social and Behavioral Analytics DSSBA 2022 data science journal
IMPORTANT: This is the old Data Mining forum.
I keep it online so that you can read the old messages.

Please post your new messages in the new forum: https://forum2.philippe-fournier-viger.com/index.php
 
Reg Association Rules
Posted by: anu
Date: September 20, 2013 05:48AM

I am getting an Java Heap size error when running the "Example 23 : Mining All Association Rules with the Lift Measure" with large dataset (around a gb).

I have tried increasing the Heap size in Eclipse. Still I am getting this error.
Also I am getting wrong Lift measure when running this code. Please recheck it.

Could anyone please help me to resolve these error.

Thanks in advance

Options: ReplyQuote
Re: Reg Association Rules
Date: September 20, 2013 06:10AM

For the memory usage, it depends on:
- the parameters that you use (minsup, minconf, minlift). The lower that you will set these parameters, the number of association rules will grow exponentially and you are more likely to run out of memory. Usually, it is recommended to start with a high value and then to lower these parameters.
- Note also that the characteristics of your dataset may be a part of the problem. For example, if you have a very dense datasets where all transactions (lines) are very long and almost exactly the same, there may be a huge amount of association rules and the algorithm may just not terminate or run out of memory.
- A gigabyte.. your dataset is pretty huge for this kind of algorithm. Perhaps that you could use some sampling or some filtering to make the dataset smaller or remove useless information. This would make the algorithm faster/use less memory.
- Another possibility if you are a programmer is to add some constraints to the algorithms such as not mining rules with more than X items. This would obviously reduce the search space greatly and make the algorithm faster / use less memory.

For the lift computation,
- can you provide a simple example that shows that the lift is incorrect?
- A reason why the lift may be incorrect is if the input format is not correct. For example, in the input format, an item is only allowed to appear once per line and on each line items should always be sorted according to a the same order. If these are not done, the algorithm may not calculate the measures correctly or generate all the rules.
- Another point is that I have fixed an error in the lift calculation before. If you are using SPMF before v0.92, there was a problem with the lift and you could update to the latest version.

Best,

Philippe



Edited 1 time(s). Last edit at 09/20/2013 06:11AM by webmasterphilfv.

Options: ReplyQuote
Re: Reg Association Rules
Posted by: anu
Date: September 21, 2013 04:03AM

Thank you for the reply.

I have set minsup and minconf as 0.9 which is a higher value. But yeah one of my dataset is dense where my transactions are long may be thats why the algorithm runs out of memory. In that case, Could you please tell me what I need to do.

Also for the lift computation, I have read in some research paper that the range of lift is between 0 and 1. But in my dataset, I am getting the lift values more than 1 for some rules.

My dataset is in this format (these are some transactions that I have mentioned here):

fqt ilt cpr cqt iku ilt ilt ilt ilt ilt xxyyzz FPY FPU FPW COW DQW BOW FPU GGHHII GGHHII GGHHII GGHHII
alt alt ikt ikt ilt alt ilt ikt ilt ilt alt DMY DOX COU BOW COX GGHHII GGHHII GGHHII GGHHII GGHHII GGHHII
cpr alt ilt dpt ikt imu ikr xxyyzz xxyyzz xxyyzz xxyyzz FPU FRX FQX DPW DOW FRW GGHHII GGHHII GGHHII GGHHII GGHHII
imt cou bnt clr imt imt ikt xxyyzz xxyyzz xxyyzz xxyyzz FQU FSW BNW FRX FRW EOW FRW GGHHII GGHHII GGHHII GGHHII
cou bmt bnr bnt ikt bkr imt ilt ilt ilt xxyyzz FPX FRU FPU COU BMW FRU COW GGHHII GGHHII GGHHII GGHHII

And output of this dataset which have lift value more that 1 (there are many such rules like these which I am getting lift value more than 1):

b1k2q3a4 ==> B1K2Q3A4 sup= 0.5338028169014084 % conf= 0.5338028169014084 lift= 7.518349533822653E-4
==> B1K2Q3A4,b1k2q3a4 sup= 0.5338028169014084 % conf= Infinity lift= Infinity
A1K2Q3A4 ==> B1K2Q3A4 sup= 0.5338028169014084 % conf= 0.5338028169014084 lift= 7.518349533822653E-4
B1K2Q3A4,B1K2Q3A4 ==> A1K2Q3A4 sup= 0.5338028169014084 % conf= 1.0 lift= 0.0014084507042253522
==> A1K2Q3A4,B1K2Q3A4 sup= 0.5338028169014084 % conf= Infinity lift= Infinity
a1k2q3a4 ==> B1K2Q3A4 sup= 0.5338028169014084 % conf= 0.5338028169014084 lift= 7.518349533822653E-4


Please correct me if I am doing anything wrong.
Thanks

Options: ReplyQuote
Re: Reg Association Rules
Date: September 21, 2013 06:15AM

For the lift

In association rule mining, the range of values for the lift is -infinity to +infinity.

A value less than 1 means that there is a negative relationship.

A value higher than 1 means that there is a positive relationship

A value of 1 means that there is no relationship.

If you are curious, you can see the definition of the lift and other measures in the chapter 4 of the book by Kumar (especially p.47) : http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf

About the input file

I see two problems. First on each line there should not be duplicate items.
Second, each line should be sorted. I see that this is not the case in your file. For example, in these two lines:

imt cou bnt clr imt imt ikt xxyyzz xxyyzz xxyyzz xxyyzz FQU FSW BNW FRX FRW EOW FRW GGHHII GGHHII GGHHII GGHHII
cou bmt bnr bnt ikt bkr imt ilt ilt ilt xxyyzz FPX FRU FPU COU BMW FRU COW GGHHII GGHHII GGHHII GGHHII

imt appears before cou in the first line. But cou appears before imt in the second line. So that means that your lines are not sorted. This can cause some problems to the algorithms because generally association rule and pattern mining algorithms assumes that the lines are sorted for some optimization. If they are not sorted, the algorithms may not generate the correct result (miss some patterns for example).

This problem could be fixed by modifying the method for reading the file so that it will sort your lines and remove duplicate items. I don,t have the code in front of me now, but there should be a method called loadFile() that is responsible of reading the input file.

But fixing this may not have a great impact on memory usage. A solution may be to use some sampling on your dataset (just use the first 10 000 lines for examples) or remove some items that you know are not useful in your data and you are not interested to find patterns with. Those are some ideas that could be tried.

Philippe

Options: ReplyQuote


This forum is powered by Phorum and provided by P. Fournier-Viger (© 2012).
Terms of use.