The Data Mining Forum                             open-source data mining software data mining conferences Data Science for Social and Behavioral Analytics DSSBA 2022 data science journal
IMPORTANT: This is the old Data Mining forum.
I keep it online so that you can read the old messages.

Please post your new messages in the new forum: https://forum2.philippe-fournier-viger.com/index.php
 
SPMF 0.89 - CFPGrowth algorithm for mining itemsets with multiple support threshold
Date: August 26, 2012 08:37PM

Hi everyone,

This is a quick message to let you know that a new version of SPMF has been released. It is version 0.89 and it offers an implementation of the CFPGrowth algorithm provided by Azadeh Soltani.

SPMF now offers two algorithms for mining itemsets with multiple minimum supports: MSApriori and CFPGrowth.

Best,
Philippe

Options: ReplyQuote
Re: SPMF 0.89 - CFPGrowth algorithm for mining itemsets with multiple support threshold
Posted by: Manperta Negara
Date: March 03, 2014 10:32PM

how can i generate ass.rule from the cfp-growth algorithm? i mean the source code for ass.rule with cfp-growth, cz i've tried the agrawal but it's not running, any advice please

Options: ReplyQuote
Re: SPMF 0.89 - CFPGrowth algorithm for mining itemsets with multiple support threshold
Date: March 04, 2014 04:10AM

Hello,

This has not been implemented. But it would be possible to do it by modifying the source code a little bit. I will explain this.

The implementation of FPGrowth (ca.pfv.spmf.algorithms.frequentpatterns.fpgrowth.AlgoFPGrowth) can save to a file or save the result to memory in a variable "itemsets". Then, the code for association rule generation use these itemsets into memory to generate the association rules.

For the implementation of CFPGrowth the result is always saved to a file. If you want to use the result to generate association rule with the existing code, the only modification would be to modify CFPGrowth so that it can keep the result into memory just like how it is done in the implementation of FPGrowth. You would need to modify AlgoCFPGrowth.

If it is really important for you and you cannot do it by yourself, I could help to do it. It should not be complicated.

Philipe

Options: ReplyQuote
Re: SPMF 0.89 - CFPGrowth algorithm for mining itemsets with multiple support threshold
Posted by: Manperta Negara
Date: March 04, 2014 04:00PM

oh i see sir, but i can't modify it. Could you please help me, it's for my final project, I am very grateful for your kindness.

Options: ReplyQuote
Re: SPMF 0.89 - CFPGrowth algorithm for mining itemsets with multiple support threshold
Date: March 04, 2014 06:22PM

Hello,

I had some free time tonight so I have done it.

You can download the source code or JAR file again from the download section of the website.

If you are using the source code, you will notice new examples for generating association rules with CFPGrowth

MainTestAllAssociationRules_CFPGrowth_saveToMemory
MainTestAllAssociationRules_CFPGrowth_saveToMemory_withLift
... etc.

If you are using the graphical interface, you will notice new algorithms:

CFPGrowth_association_rules_with_lift
CFPGrowth_association_rules

I will update the documentation on the website to add some examples later.



Edited 1 time(s). Last edit at 03/04/2014 06:23PM by webmasterphilfv.

Options: ReplyQuote
Re: SPMF 0.89 - CFPGrowth algorithm for mining itemsets with multiple support threshold
Posted by: Manperta Negara
Date: March 05, 2014 12:32AM

thank you sir for the spmf update version smiling smiley but i've got another problem, when i run my data transaction.txt with my MIS.txt, the output isn't correct, i've got many rules that some of them doesn't have confidence count, and also some of the rules were duplicate rules

The example of my output.txt below :

126 ==> 133 #SUP: 2 #CONF: ? <----------(doesn't have confidence count)
107 ==> 133 #SUP: 2 #CONF: 0.25
160 54 ==> 162 #SUP: 1 #CONF: 1
160 14 ==> 162 #SUP: 1 #CONF: 0.33333
159 111 ==> 160 #SUP: 1 #CONF: ? <----------(doesn't have confidence count)
159 70 ==> 160 #SUP: 1 #CONF: 0.25
143 114 ==> 160 #SUP: 1 #CONF: 0.5
143 86 ==> 160 #SUP: 1 #CONF: ? <-----------(doesn't have confidence count)
130 29 ==> 160 #SUP: 1 #CONF: 0.33333
130 14 ==> 160 #SUP: 1 #CONF: 0.25
127 21 ==> 160 #SUP: 1 #CONF: 0.2
127 118 ==> 160 #SUP: 1 #CONF: 0.5
125 95 ==> 160 #SUP: 1 #CONF: 0.2
125 69 ==> 160 #SUP: 1 #CONF: 0.25
125 54 ==> 160 #SUP: 1 #CONF: ? <----------(doesn't have confidence count)
111 21 ==> 160 #SUP: 1 #CONF: 0.2
111 134 ==> 160 #SUP: 1 #CONF: 0.33333
111 70 ==> 160 #SUP: 1 #CONF: 0.25
103 54 ==> 160 #SUP: 1 #CONF: ? <-----------(doesn't have confidence count)

{ Below is duplication rules, output.txt has been prune due to space threshold}

73 125 ==> 137 #SUP: 1 #CONF: 1 <-----(duplicate rules)
73 125 ==> 137 #SUP: 1 #CONF: 1 <-----(duplicate rules)

Maybe you can explain about this problem sir, thank you very much.

Options: ReplyQuote
Re: SPMF 0.89 - CFPGrowth algorithm for mining itemsets with multiple support threshold
Date: March 05, 2014 02:49AM

Does your input file has duplicate items in transactions?

Are your transactions ordered?

Maybe you could send me your input file.

I think that the problem may be the input file

Philippe

Options: ReplyQuote
Re: SPMF 0.89 - CFPGrowth algorithm for mining itemsets with multiple support threshold
Posted by: Manperta Negara
Date: March 05, 2014 05:46AM

Yes sir, maybe the problem is my input file, i've send my input file: transaction.txt and MIS.txt in your email, thank you sir.

Options: ReplyQuote
Re: SPMF 0.89 - CFPGrowth algorithm for mining itemsets with multiple support threshold
Date: March 05, 2014 10:58AM

Hi,

The problem was that combining CFPGrowth with the algorithm of Agrawal for association rule generation made some new scenarios that were not expected by the algorithm for generating association rule.

The problem was that if a rule X --> Y was such that X was infrequent according to CFPGrowth than the algorithm assume that the support of X is zero (even if it may not be zero) and thus the confidence was equal to infinity. This problem does not occur when the minimum support is fixed. But when there are multiple minimum supports such as with CFPGrowth, this problem can occur because some itemsets become infrequent although they should not if the support was always the same for all items.. For example, with CFPGrowth, an itemset 126, 127 128 could be frequent although 126 could be infrequent, which would not be ok if there was only a minimum support threshold. But the association rule generation algorithm assumes that if an itemset is frequent, all its subsets should be frequent.

For example, one rule was 126 --> ... and 126 was infrequent according to CFPGrowth. Therefore, when this rule was generated, it had a confidence of zero.

The easiest solution to fix this problem is to remove all such rules where the left side of the rule is infrequent. This is what I have done. I have updated the source code and jar file on the website to implement this solution.

An alternative solution would be to keep these rules and scan the database to calculate the support of their left side, and thus their confidence exactly. But it would be more complicated. Furthermore, in the CFPGrowth paper, the authors does not define how to generate the rules from the itemsets found, So they do not say how to handle this case. For this reason, I have taken the easiest solution which is simply to remove all such rules where the left side is infrequent.

Best,

Philippe



Edited 2 time(s). Last edit at 03/05/2014 11:02AM by webmasterphilfv.

Options: ReplyQuote
Re: SPMF 0.89 - CFPGrowth algorithm for mining itemsets with multiple support threshold
Posted by: Manperta Negara
Date: March 05, 2014 11:32PM

Now I see, sir. I am very happy with your explanation smiling smiley At first I'm wondering whether it affects the results of the calculation of association rules. But i'm finally realised that the infrequent items (ex. 126) indeed shouldn't be included in calculation of ass.rule, because the items(like 126) is less than MIN_Freq. Correct me if i'm mistaken, thanks again sir smiling smiley

Options: ReplyQuote
Re: SPMF 0.89 - CFPGrowth algorithm for mining itemsets with multiple support threshold
Date: March 06, 2014 02:33AM

You are welcome.

Options: ReplyQuote
Re: SPMF 0.89 - CFPGrowth algorithm for mining itemsets with multiple support threshold
Posted by: Manperta Negara
Date: April 16, 2014 06:22PM

i have another question that bothering me this several weeks, sir. I realize that the function of multiple minimum support is to keep the infrequent pattern which above MIN value (MIN = the smallest MIS).
Example from reference paper :
there are 4 items A,B,C,D with MIS (A)=5%. MIS (cool smiley=15%, MIS(C)=30%, MIS(D)=40%.
If A.supp=3%, B.supp=20%, C.supp=25%, D.supp=50% Then set of frequent patterns = {B,C,D}.

from example above, C is infrequent item, because C.supp=25% whereas MIS(C)=30%. But C keep in the set of frequent item cz MIS(C) > MIN

note : MIN = min[MIS(A), MIS(cool smiley, MIS(C), MIS(D)]= 5%.

Question : Is it ok if cfp-growth ass.rule from spmf doesn't have the output of pattern C, sir?
because reference paper from Han said that we must keep those infrequent item in case if maybe those infrequent item have strong association rules combining with other frequent item.

Options: ReplyQuote
Re: SPMF 0.89 - CFPGrowth algorithm for mining itemsets with multiple support threshold
Date: April 17, 2014 09:12AM

I have checked the paper of CFPGrowth.

Suppose that:

MIS(A) = 5%
MIS( B ) = 15%
MIS ( C ) = 30 %
MIS ( D ) = 40 %

and that:

A.supp=3%,
B.supp=20%,
C.supp=25%,
D.supp=50%

The algorithm outputs frequent itemsets. A frequent itemset is defined as an itemset such that its support is higher or equal to MIN.

Therefore, according to this definition, C should not be output because it is infrequent (it support is less than MIN).

However, the algorithm still needs to consider C as an item that can appear in larger itemsets such as BC that may be frequent. Suppose that BC has a support of 15 %. Then BC would b frequent and would be output.

I think that the fact that C should not be output but that it can still appear in larger frequent itemsets is what is confusing you.



Edited 3 time(s). Last edit at 04/17/2014 09:15AM by webmasterphilfv.

Options: ReplyQuote
Re: SPMF 0.89 - CFPGrowth algorithm for mining itemsets with multiple support threshold
Posted by: Manperta Negara
Date: April 30, 2014 06:43AM

maybe you can fix this problem sir.

sample dataset from Uday & Reddy, 2011 (20 transactions) :http://www.edbt.org/Proceedings/2011-Uppsala/papers/edbt/a3-kiran.pdf

1 2
1 5 6
3 4
1 2 8
3 4
1 3
1 2
5 6
3 4 7
1 2
1 2
1 3
1 2
2 5 6 7
3 4
1 2 4
3 4
1 3
1 2 5
3 4

mis :

1 10
3 10
2 8
4 6
5 3
6 3
7 3
8 2

the paper said that freq 1-pattern = { (1: 12),(2: 9),(4: 7),(5: 4),(6: 3)}
freq 2-patterns = {(5,6: 3), (3,4: 6), (1,2: 8)}

That's excactly what you've told me about infrequent pattern that maybe frequent in higher order (3 infrequent in 1-pattern,but frequent in 2-pattern). Happily The output of spmf.jar results the same, sir.

The problem is when spmf result the ass.rule(minconf is set 0.1), which the result is just like this :
5 ==> 6 #SUP: 3 #CONF: 0,75
1 ==> 2 #SUP: 8 #CONF: 0,66667

Please sir, maybe you can tell me why spmf ass.rule cfp-growth++ doesn't have another rule output below

6->5, sup: 3 conf: 3/3 = 100%
2->1, sup: 8 conf: 8/9 = 0,889
4->3, sup: 6 conf: 6/7 = 0,857
3->4, sup: 6 conf: ?----->(this is the problem that you solved which remove the infrequent item on the left side, because reference paper didn't say how to fix this problem, i accept that)

But what about another 3 rules that should also be the output of ass.rule, sir?
If it's just my mistake to understand, I'm really pleased to know the answer.

Many many many thanks, sir smiling smiley

Options: ReplyQuote
Re: SPMF 0.89 - CFPGrowth algorithm for mining itemsets with multiple support threshold
Posted by: Manperta Negara
Date: April 30, 2014 07:15AM

notice that i'm using the old version of cfp-growth++ ass.rule from spmf.jar (before v0.96b - 2014-04-24), and support of each item when i count manually is :
1, s:12
2, s:9
3, s:9
4, s:7
5, s:4
6, s:3
7, s:2
8, s:1

and from Uday Kiran paper, the item from database sample above is (a,b,c,d,e,f,g,h) which i have transformed to (1,2,3,4,5,6,7,8).

Options: ReplyQuote
Re: SPMF 0.89 - CFPGrowth algorithm for mining itemsets with multiple support threshold
Date: April 30, 2014 02:19PM

Hi,

There was a bug. The problem was that itemsets were not sorted and thus the algorithm for generating association rules was not working correctly.

To fix the problem, edit the file AlgoCFPGrowth.java.

Replace this:

private void writeItemsetToFile(int[] itemset, int lastItem, int support)
			throws IOException {
		// increase the number of frequent itemsets found
		itemsetCount++;



by :

private void writeItemsetToFile(int[] itemset, int lastItem, int support)
			throws IOException {
		// increase the number of frequent itemsets found
		itemsetCount++;
		
		Arrays.sort(itemset);


After that, it will produce the following results:

------- ASSOCIATION RULES -------
  rule 0:  5  ==> 6 support :  0.15 (3/20) confidence :  0.75
  rule 1:  6  ==> 5 support :  0.15 (3/20) confidence :  1.0
  rule 2:  1  ==> 2 support :  0.4 (8/20) confidence :  0.6666666666666666
  rule 3:  2  ==> 1 support :  0.4 (8/20) confidence :  0.8888888888888888
 --------------------------------

I will update also the code on the SPMF website.

If you want that I add your name to the contributors list of SPMF under "bug report" to acknowledge that you reported a bug, just tell me your name here or by e-mail. ;-)

Thanks,

Options: ReplyQuote
Re: SPMF 0.89 - CFPGrowth algorithm for mining itemsets with multiple support threshold
Posted by: Manperta Negara
Date: April 30, 2014 09:56PM

smiling smiley My name is Manperta Negara Situmorang.Thank you for the reward, sir smiling smiley

how about rule : 3 ==> 4 support : 0.3 (6/20) confidence : 0.6666666667 (6/9) and rule : 4 ==> 3 support : 0.3 (6/20) confidence : 0.8571428571 (6/7), sir?

Is it possible if the association rule generation algorithm can be modified so the algorithm assume that item '3' is still included in the calculation of association rules?

For example, one rule was X --> Y and X was infrequent according to CFPGrowth. Therefore, when this rule was generated, it had a confidence of zero.....

.....(Is it possible to make the algorithm thinks that although x is not frequent, x is still included in the calculation of association rules, so that the algorithm doesn't think that it's confidence is zero , sir?)

From the Uday Kiran paper that i've read many times, i'm pretty sure that if support of each item is found, then from it's support we can calculate the rule. So the result of the example data above is including rule : 3 ==> 4 support : 0.3 (6/20) confidence : 0.6666666667 (6/9) and rule : 4 ==> 3 support : 0.3 (6/20) confidence : 0.8571428571 (6/7)

-because formula of rule 3 ==> 4 = number of transactions containing (3,4) / number of transactions containing (3) = 6/9
-because formula of rule 4 ==> 3 = number of transactions containing (4,3) / number of transactions containing (4) = 6/7
- Because the paper about rare item problem is to find the frequent item including rare pattern, and it should also apply to the association rules, i think.

forgive me if I ask too many questions, sir. I hope you don't mind :')

Options: ReplyQuote
Re: SPMF 0.89 - CFPGrowth algorithm for mining itemsets with multiple support threshold
Date: May 01, 2014 07:05AM

In the current version of the source code, association rules are generated using the Agrawal (1994) algorithm by using the frequent itemsets.

The algorithm by Agrawal generate a rule X --> Y - X by combining pairs of frequent itemsets X and Y such that X is a subset of Y.

So using this algorithm, if we want to find the rule 3 ==> 4, then the itemsets {4, 3} and {3} should be frequent. But {3} is not frequent using CFPGrowth. That is the reason why this rule is not generated.

Now, how could this be modified to generate rule 3 ==> 4 ? This is not explained in the paper describing CFPGrowth or MISApriori. Therefore, these authors did not solve the problem of using the itemsets to generate the rules. Perhaps that some other authors after them have solve this problem though. But I did not search for that on Google.

An idea to solve the problem is to keep the itemset {3} as you have mentionned. However, if we do that, then we probably don't need to use CFPGrowth. We could just use FPGrowth with minsup = MIS, perhaps... Moreover, this solution would not be efficient because it would require to keep all frequent itemsets with a support >= MIS. So there would be no advantage of using CFPGrowth instead of FPGrowth.

In my opinion, modifying the algorithm to generate 3 ==>4 is not trivial and requires to think about it for a while, because it may require to design a new algorithm instead of using the algorithm of Agrawal for rule generation or to keep all itemsets with a support >= MIS (therefore we use FPGrowth?). Perhaps that some researchers have already addresed this problem. I don't know. You may also search on Google for "mining association rules with multiple support thresholds". Perhaps that you could find some ideas about that.

Otherwise, if we don't care about the performance, a simple solution is to generate all the association rules and then eliminate some rules by post-processing. But it would not be efficient.

Moreover, another problem may be how to define what is an association rules with multiple minimum support? If we have a rule X ==> Y, should we consider that X need to have a support >= MIS or Y? or X U Y ? I think that various definitions may be possible. Once again, perhaps that some researchers have proposed a definition for this but I did not search for papers about that.

That is my opinion about that.



Edited 4 time(s). Last edit at 05/01/2014 07:12AM by webmasterphilfv.

Options: ReplyQuote
Re: SPMF 0.89 - CFPGrowth algorithm for mining itemsets with multiple support threshold
Posted by: Manperta Negara
Date: April 30, 2014 10:13PM

and also i've add Arrays.sort(itemset); to file AlgoCFPGrowth.java and runs it both with the newest source code version and newest jar version but the result is still the same as before, sir. Just 2 rules generated.

Options: ReplyQuote
Re: SPMF 0.89 - CFPGrowth algorithm for mining itemsets with multiple support threshold
Date: May 01, 2014 04:17AM

You are right, there was still a problem.

Here is the correct version of the method "writeItemsetToFile" for AlgoCFPGrowth.java:

/**
	 * Write a frequent itemset that is found to the output file.
	 * @param itemset  an itemset
	 * @param lastItem an item that should be appended to the itemset
	 * @param support the support of "itemset" + "item".
	 */
	private void writeItemsetToFile(int[] itemset, int lastItem, int support)
			throws IOException {
		// increase the number of frequent itemsets found
		itemsetCount++;
		
		// if the result should be saved to a file
		if(writer != null){
			// Create a string buffer
			StringBuffer buffer = new StringBuffer( );
			// write the items of the itemset
			for(int i=0; i< itemset.length; i++){
				buffer.append(itemset);
				buffer.append(' ');
			}
			buffer.append(lastItem);
			// Then, write the support
			buffer.append(" #SUP: "winking smiley;
			buffer.append(support);
			// write to file and create a new line
			writer.write(buffer.toString());
			writer.newLine();
		}// otherwise the result is kept into memory
		else{
			// concatenate the last item to the itemset
			int [] itemsetWithLastItem = new int[itemset.length+1];
			System.arraycopy(itemset, 0, itemsetWithLastItem, 0, itemset.length);
			itemsetWithLastItem[itemset.length] = lastItem;
			
			Arrays.sort(itemsetWithLastItem); // ADDED TO FIX ASSOCIATION RULE BUG FOR CFPGROWTH+
			
			// create an object Itemset and add it to the set of patterns 
			// found.
			Itemset itemsetObj = new Itemset(itemsetWithLastItem);
			itemsetObj.setAbsoluteSupport(support);
			patterns.addItemset(itemsetObj, itemsetObj.size());
		}
	}

For the other question, I will answer you later.



Edited 2 time(s). Last edit at 05/01/2014 04:18AM by webmasterphilfv.

Options: ReplyQuote


This forum is powered by Phorum and provided by P. Fournier-Viger (© 2012).
Terms of use.