Re: Analysis of mining Frequent sequential pattern algorithm's

The Data Mining Forum

open-source data mining software

data mining conferences

Data Science for Social and Behavioral Analytics DSSBA 2022

data science journal

IMPORTANT: This is the old Data Mining forum.
I keep it online so that you can read the old messages.

Please post your new messages in the new forum: https://forum2.philippe-fournier-viger.com/index.php

Goto Topic: Previous•Next

Goto: Forum List•Message List•New Topic•Search•Log In•Print View

Analysis of mining Frequent sequential pattern algorithm's

Posted by: vivek basati

Date: March 22, 2012 08:37AM

hi everyone.

In frequent pattern mining algorithms like prefix-span or SPAM or

among any other algorithms which one would be efficient??

If you cant say directly, can you clarify me which one would be efficient

considering criteria's like time,memory,no. of counts etc...??

Ill be glad if anyone help me in this analysis. smiling smiley

Edited 1 time(s). Last edit at 03/24/2012 12:29PM by webmasterphilfv.

Options: Reply•Quote

Re: Analysis of mining Frequent sequenctial pattern algorithm's

Posted by: vivek basati

Date: March 22, 2012 09:19AM

I came across of algorithm SPADE in this mining process...

post me criterion based analysis...among these algorithms.

Options: Reply•Quote

Re: Analysis of mining Frequent sequenctial pattern algorithm's

Posted by: tisonet

Date: March 22, 2012 09:58AM

HI, I have implemented SPAM, LAPIN-SPAM, PrefixSpan and BIDE algorithm in C#.
SPAM and LAPIN-SPAM have better performance for mining DENSE data sets, that means data which has relative few unique items (4-1000) and long patterns like DNA and protein sequences.
PrefixSpan and BIDE is more efficient for mining SPARSE data sets. That is lots of items and short sequences. Typical example are web logs. Lots of product pages at e-shop and every customer visit a few pages in one session.

Options: Reply•Quote

Re: Analysis of mining Frequent sequenctial pattern algorithm's

Posted by: webmasterphilfv

Date: March 22, 2012 03:50PM

Very good answer. In addition to this, there is some experiments in the papers. For example, in the SPAM paper, they compare SPAM with PrefixSpan. Sometimes experiments in papers are not fair because the authors choose some datasets where their algorithm performs better. But it can give a rough idea about their relative performance.

Options: Reply•Quote

Re: Analysis of mining Frequent sequenctial pattern algorithm's

Posted by: vivek basati

Date: March 22, 2012 06:58PM

thank you tisonet and philippe.

can i get normal large and small data sets so that i can use them directly with out need to change the code in SPMF framework ??

Because i don't have so good hand-on experience in doing java code.

so i await for this favor in doing my analysis part between prefix-span and SPAM.

Options: Reply•Quote

Re: Analysis of mining Frequent sequenctial pattern algorithm's

Posted by: webmasterphilfv

Date: March 22, 2012 08:26PM

Hello Vivek,

I converted 3 datasets for you:
- BMS
- Kosarak (contains only the first 70 000 sequences)
- Snake

For Snake, I will tell you by private message because it is not a public dataset so I don't want to post it here.

Philippe

Edited 4 time(s). Last edit at 03/22/2012 09:00PM by webmasterphilfv.

Options: Reply•Quote

Re: Analysis of mining Frequent sequenctial pattern algorithm's

Posted by: tisonet

Date: March 23, 2012 03:32AM

Nice work.
Do you have also these data sets in binary (IBM GEN.) format?

Options: Reply•Quote

Re: Analysis of mining Frequent sequenctial pattern algorithm's

Posted by: webmasterphilfv

Date: March 23, 2012 05:01AM

Thanks tisonet.

I tried to make some binary version for you:
BMS
Kosarak70K (70 000 first lines of Kosarak)
Snake (will send to your private message).

I think it should work. But i did not have time to test it.

Philippe

Edited 1 time(s). Last edit at 03/23/2012 05:01AM by webmasterphilfv.

Options: Reply•Quote

Re: Analysis of mining Frequent sequenctial pattern algorithm's

Posted by: Dvijesh88

Date: March 23, 2012 05:11AM

one question i have

if we work with the binary dataset than algorithm behave different? means take less time or memory. if no than why we use it. because we cant see it properly.

general idea say yes it should work fast because computer deal with the binary data

Options: Reply•Quote

Re: Analysis of mining Frequent sequenctial pattern algorithm's

Posted by: tisonet

Date: March 23, 2012 05:17AM

No, mining time should be the same. Maybe sequence database is a little faster loaded from source file. But it doesnt affect mining.
I had testing program which can read data from binary file, so i dont wanna create new one for different type of source data. It is the only reason.

Options: Reply•Quote

Re: Analysis of mining Frequent sequenctial pattern algorithm's

Posted by: Dvijesh88

Date: March 23, 2012 05:21AM

oh ok

i just ask. because i want to convert dataset from the binary dataset, which is generated by IBM data generator, to the dataset which used in my work. so i want to know that should i deal with binary if it give faster output.

thanks

Options: Reply•Quote

Re: Analysis of mining Frequent sequenctial pattern algorithm's

Posted by: Dvijesh88

Date: March 22, 2012 08:29PM

Well as far as i know that

Prefixspan more efficient than the Span if you really working on the large dataset as tisonet told. agree with him

but i read two or more articles, in that they mention that prefixspan is better than the SPAM. even my guide also told me that not go for the Bit-wise SPAM. find some thing else to improve the performance of current system. becz SPAM is not much efficient.

and if we work for the research work i think our algorithm should work for the large dataset as well.

this is only my thought. i can be wrong.

i also like the SPAM and i also feel the performance increase but for the large dataset prefixspan work even better than the SPAM.

bottom line : it depends of which dataset and for which application, you want the algorithms.

Options: Reply•Quote

Re: Analysis of mining Frequent sequenctial pattern algorithm's

Posted by: vivek basati

Date: March 22, 2012 10:15PM

Hello philiphe,

i tried to execute the prefix span using the dataset which u gave i.e.BMS converted to text but its not giving the result and showing as
Total time ~ 228 ms
Frequent sequences count : 0
Max memory (mb) : 0.00
with minsup as 0.5

any solution?

Options: Reply•Quote

Re: Analysis of mining Frequent sequenctial pattern algorithm's

Posted by: Dvijesh88

Date: March 22, 2012 10:46PM

run the program with the less support. like 0.2

Options: Reply•Quote

Re: Analysis of mining Frequent sequenctial pattern algorithm's

Posted by: vivek basati

Date: March 23, 2012 12:04AM

Its showing same result as count and memory 0, with all varying support..

Options: Reply•Quote

Re: Analysis of mining Frequent sequenctial pattern algorithm's

Posted by: webmasterphilfv

Date: March 23, 2012 04:08AM

Then you try again with a lower minimum support like 0.01 and you will get some patterns.

For 0.01 i get 78 patterns with PrefixSpan

0.01 means 1%.

If you use a lower support, you will get even more patterns.

Options: Reply•Quote

Re: Analysis of mining Frequent sequenctial pattern algorithm's

Posted by: vivek basati

Date: March 23, 2012 12:42AM

In SPAM algorithm if i increment BIT_PER_SECTION i was able to get Frequent sequences count upto 1053 for kosarat datasets(converted to text) but coming to prefix-span im getting just count 6.

can i know any changes i have to do in prefixspan too for kosarat-datasets?

Options: Reply•Quote

Re: Analysis of mining Frequent sequenctial pattern algorithm's

Posted by: webmasterphilfv

Date: March 23, 2012 04:38AM

Hello,

There is no changes to do for PrefixSpan.

The only change to do is for SPAM. By default the implementation assumes that there is no sequences longuer than 32 itemsets.

In BMS there is 59601 sequences. The longest sequence is 267 itemsets.

In the part of Kosarak that I gave to you, there is 70 000 sequences and the longest sequence is 796 itemsets

So for BMS, BIT_PER_SECTION should be set higher than 267
and for Kosarak, BIT_PER_SECTION should be set higher than 796.

If you do that, the two algorithms should give the same results.

But be aware that PrefixSpan takes a percentage as input (for example 0.01 of the sequences) whereas SPAM takes an integer as input (for example 59601 * 0.01 = 596 sequences). So to compare, you would need to calculate

Options: Reply•Quote

Re: Analysis of mining Frequent sequenctial pattern algorithm's

Posted by: vivek basati

Date: March 23, 2012 07:21AM

Thank you philippe.

But while executing SPAM i'm getting exception as
Exception in thread "AWT-EventQueue-0" java.lang.OutOfMemoryError: Java heap space

So its memory problem, how can i solve it??

Options: Reply•Quote

Re: Analysis of mining Frequent sequenctial pattern algorithm's

Posted by: webmasterphilfv

Date: March 23, 2012 11:38AM

Hi,

For memory there is two ways:
- change the parameters of the algorithm
- increase the memory that the Java virtual machine can use ( see here for instructions about how to do this: http://www.philippe-fournier-viger.com/spmf/index.php?link=FAQ.php#memory )

Options: Reply•Quote

Re: Analysis of mining Frequent sequenctial pattern algorithm's

Posted by: vivek basati

Date: March 23, 2012 04:43PM

can you tell how to do it in netbeans to increase java virtual machine memory?

Options: Reply•Quote

Re: Analysis of mining Frequent sequenctial pattern algorithm's

Posted by: webmasterphilfv

Date: March 23, 2012 05:44PM

I don't have netbeans on my computer. But I have searched on Google and some people say:

"In NetBeans, you can add command line options using the Properties of the Project, the Run option. There is an option for the JVM command line there. Look at the -Xms and -Xmx options."

More information here about how to do it:

http://www.codemiles.com/java/here-how-to-increase-java-heap-size-t663.html

Options: Reply•Quote

Re: Analysis of mining Frequent sequenctial pattern algorithm's

Posted by: Dvijesh88

Date: March 23, 2012 09:38PM

hello sir,

yes you are right. i am using netbeans and Memory for the JVM will be set as you told.

right click on the project go to properties than choose the Run and write -Xmx 1024m so JVM will use the 1gb memory

Options: Reply•Quote

Re: Analysis of mining Frequent sequenctial pattern algorithm's

Posted by: webmasterphilfv

Date: March 24, 2012 05:50AM

Thanks! I will add this to the FAQ on the website. smiling smiley

Options: Reply•Quote

Re: Analysis of mining Frequent sequenctial pattern algorithm's

Posted by: vivek basati

Date: March 24, 2012 12:43AM

i was able to set but which MainClass should i select for executing SPAM algo in your SPMF framework??

i was able to select only the MainTestSPAM class in tests.if i select this class also i am getting same error...

Options: Reply•Quote

Re: Analysis of mining Frequent sequenctial pattern algorithm's

Posted by: webmasterphilfv

Date: March 24, 2012 05:48AM

For running the test for SPAM, yes it is the class MainTestSPAM.

If you still run out of memory with 1 GB, it might be because the dataset is too large for SPAM with the value that you set for BIT_PER_SECTION. If you have enough memory you could try 2GB. Or you could try with loading only half of the dataset (to do that you would to modify a little bit the code).

Another reason may be that you use a too small value for minsup. So you may try to use a higher value of minsup.

Options: Reply•Quote

Re: Analysis of mining Frequent sequenctial pattern algorithm's

Posted by: Dvijesh88

Date: March 24, 2012 12:04PM

i think you take too much small support.

because 1Gb is enough for the billions of sequences. but yes as philfv said if u have 2Gb ram than you can utilize the full memory resource.

can you tell me the size of your dataset? (in mb)

Options: Reply•Quote

Re: Analysis of mining Frequent sequenctial pattern algorithm's

Posted by: vivek basati

Date: March 24, 2012 06:51PM

hello,

my dataset is of 1.51mb.

problem is ,i want to compare prefixspan and spam for same large dataset with same minsupport but prefix want 1% to 5% minsup to get more patterns whereas spam is running out of memory with that minsup...

so im not able to make analysis for this 2 algorithms takin large datasets.i want any medium datasets for my analysis??

Options: Reply•Quote

Re: Analysis of mining Frequent sequenctial pattern algorithm's

Posted by: webmasterphilfv

Date: March 24, 2012 07:33PM

I think it is because of the BIT_PER_SECTION. In my implementation of SPAM, I use the same amount of memory for each sequence and it may be a problem because there is some very long sequences in BMS and Kosarak.

I send you some smaller dataset. I have removed the longest sequences in them.

BMS_10k_smaller_than30.txt : the first 10k lines of BMS without sequence longer than 30 itemsets.
BMS_smaller_than30.txt : BMS without sequence longer than 30 itemsets.
Kosarak_20k_smaller_than_30.txt : the first 20k lines of Kosarak without sequence longer than 30 itemsets.
Kosarak_10k_smaller_than_30.txt : the first 10k lines of Kosarak without sequence longer than 30 itemsets.

IMPORTANT: When you use these datasets with SPAM, you should set BIT_PER_SECTION to 32. It would be better to save memory.

Options: Reply•Quote

Re: Analysis of mining Frequent sequenctial pattern algorithm's

Posted by: webmasterphilfv

Date: April 13, 2012 06:39AM

I have posted a new version of my SPAM implementation (SPMF v0.81).

I have removed the BIT_PER_SECTION variable. Now, the number of bit per sequence is variable.

The implementation is therefore more memory efficient and faster.

Options: Reply•Quote

Re: Analysis of mining Frequent sequenctial pattern algorithm's

Posted by: tisonet

Date: March 23, 2012 03:49AM

You can find compare of algorithms in a taxonomy of sequential patterns (Fig. 7).

(Fig. 8) shows density vs. algorithm execution time of algortihms. But iam not sure if these graphs are correct, because execution times are very similar for all level of density and that is weird.

The best (closed) sequential pattern mining algorithm should be BIDE but is difficult to implemented it for general sequences (itemset with any items), iam fighting with him now.

Options: Reply•Quote

Re: Analysis of mining Frequent sequenctial pattern algorithm's

Posted by: webmasterphilfv

Date: March 23, 2012 05:33AM

Thanks tisonet. Very interesting article.

I agree BIDE is probably the fastest for closed sequential patterns. There is an article about an algorith named COBRA that claimed to be faster than BIDE. But it was published in a small conference so i'm not sure if it is true or not.

By the way, if some of you are interested, I have a set of papers on sequential pattern mining that I have collected ( > 55 articles). You can download it from here.

Options: Reply•Quote

Re: Analysis of mining Frequent sequenctial pattern algorithm's

Posted by: tisonet

Date: March 23, 2012 09:59AM

thumbs up

That is amazing collection. Thanks so much.

Options: Reply•Quote

Re: Analysis of mining Frequent sequenctial pattern algorithm's

Posted by: Baxter

Date: March 25, 2012 04:52AM

Thanks !

Options: Reply•Quote

Re: Analysis of mining Frequent sequenctial pattern algorithm's

Posted by: Dvijesh88

Date: March 23, 2012 10:14AM

that sound gr8

downloading now. than will give comment on it

Options: Reply•Quote

Re: Analysis of mining Frequent sequenctial pattern algorithm's

Posted by: tisonet

Date: April 13, 2012 10:26AM

I did few experiments.

We can see that BIDE can be the most effective, but also the most ineffective.

Check graphs here.

Options: Reply•Quote

Re: Analysis of mining Frequent sequenctial pattern algorithm's

Posted by: webmasterphilfv

Date: April 13, 2012 11:22AM

Hi Tisonet.

Very interesting. Thanks for sharing your results. thumbs up

thumbs up

We can also observe than BIDE is always faster than PrefixSpan and that LAPIN-SPAM is always faster than SPAM in your experiment.

Philippe

Options: Reply•Quote

Re: Analysis of mining Frequent sequenctial pattern algorithm's

Posted by: Dvijesh88

Date: April 14, 2012 09:27AM

really good observation sir,

i really like it.

Options: Reply•Quote

Re: Analysis of mining Frequent sequential pattern algorithm's

Posted by: vivek basati

Date: April 21, 2012 06:57PM

hi,

can any one explain me why we used relative minsup for prefix-span and obsolete

minsup for SPAM in frame work??

Options: Reply•Quote

Re: Analysis of mining Frequent sequential pattern algorithm's

Posted by: webmasterphilfv

Date: April 21, 2012 07:29PM

Hello Vivek,

To use an absolute minsup for SPAM was just an implementation decision that I made. The reason why I did that was to avoid an extra database scan and thus to make the algorithm faster.

But in the latest version of SPMF that you can download from the website, I have improved the SPAM implementation and it now use a relative minsup just like the PrefixSpan implementation.

Hope this helps,

Philippe

Options: Reply•Quote

Re: Analysis of mining Frequent sequential pattern algorithm's

Posted by: vivek basati

Date: April 21, 2012 09:27PM

thank you sir.

one more doubt sir,

You have used Candidate generation in SPAM, We could not understand the S-step

process which you have used in it.

could you please explain us on what basis your selected the k bit , i.e. the index

and how are you assigning 0's and 1's to the transormed bitmap?

Please explain the process using the figure 4 in the base paper of SPAM you provided for reference...

Options: Reply•Quote

Re: Analysis of mining Frequent sequential pattern algorithm's

Posted by: webmasterphilfv

Date: April 22, 2012 06:10AM

Hello Vivek,

I tried to follow the paper for implementing SPAM. In my implementation there is a few differences due to optimization and some design decisions. But the main idea of the S step is the same.

I will explain to you the createNewBitmapSStep method in the Bitmap class, which performs the main part of the S-Step.

The S-STEP is shown in figure 4 of the paper. It consists of doing a kind of logical AND (not exactly a logical AND) with two bitmaps : the bitmap corresponding to the current pattern (e.g. {a}) and an itemset that we want to add to the current pattern (e.g. {b}). The goal is to calculate the bitmap corresponding to the resulting pattern (e.g. {a}, {b}).

So how to do that? Let's look at the createNewBitmapSStep method.

First I create a new bitmap to store the result.

BitSet newBitset = new BitSet(lastBitIndex); 
		Bitmap newBitmap = new Bitmap(newBitset);

Then, to perform the AND, the createNewBitmapSStep method first do a loop on the bits of the bitmap corresponding to the itemset that we want to add (e.g. {b}). The index bitK represents the index of a bit that is set to 1 in this bitmap. The loop will loop on all the bit set to 1 on the bitmap.

for (int bitK = bitmap.nextSetBit(0); bitK >= 0; bitK = bitmap.nextSetBit(bitK+1)) {

For each bit set to 1, bitK is the index in the bitmap. We need to determine to what sequence this bit correspond in the original sequence database and what is the last bit that corresponds to this sequence. This is done with this:

			int sid = bitToSID(bitK, sequencesSize);
 			int lastBitOfSID = lastBitOfSID(sid, sequencesSize, lastBitIndex);

Then, we check the other bitmap (e.g. {a}) to see if there is a bit set to 1 after the position bitK until the last bit of the sequence.

for (int bit = bitmapItem.bitmap.nextSetBit(bitK+1); bit >= 0 && bit <= lastBitOfSID; bit = bitmapItem.bitmap.nextSetBit(bit+1)) {
				newBitmap.bitmap.set(bit);
				match = true;
			}

If there is, that means then we set the bit to 1 in the new bitmap

newBitmap.bitmap.set(bit);
				match = true;

and then I do some optimizations like increasing the support count, and storing the last bit set to 1 for the new bitmap (lastSID). These are for optimizations purposes:

if(match){
				// update the support
				if(sid != newBitmap.lastSID){
					newBitmap.support++;
				}
				newBitmap.lastSID = sid;
			}
			bitK = lastBitOfSID; // to skip the bit from the same sequence

The idea of these optimisations it to keep track of the support to avoid always counting the bits and to keep the position of the last bit set to 1 to avoid scanning the bitmap completely.

I think that I have explained the main idea of the S-Step.

Actually it is not very complicated. You take two bitmap (e.g. {a} and {b}( and do a kind of AND to generate a new bitmap (e.g. {a},{b})

Hope this helps,

Philippe

Edited 4 time(s). Last edit at 04/22/2012 06:16AM by webmasterphilfv.

Options: Reply•Quote

Re: Analysis of mining Frequent sequential pattern algorithm's

Posted by: roooooooooop

Date: April 19, 2013 07:43PM

I have implemented Parallel Prefix Span Algorithm. Can anybody suggest me what more can i do n how can i improve the efficiency more.

Options: Reply•Quote

Re: Analysis of mining Frequent sequential pattern algorithm's

Posted by: webmasterphilfv

Date: April 20, 2013 05:32AM

Hi rooooooop,

Great. Congratulation.

For optimizations, I Will give you a few ideas:
- do you use pseudo-projection (as described in the PrefixSpan paper)?
- at the beginning of the algorithm, do you delete infrequent items from the database? (because infrequent item cannot be part of a frequent sequential pattern).
- What kind of data structure do you use? If you use array-list or vectors, could you use arrays instead? or HashSet? or HashMap? choosing appropriate data structures is always important.
- how do you represent the items? if them items are represented as strings, it will be less efficient. If you represent them as integer, it would be more efficient. If you represent them as short, it may be more efficient.
- ...

Philippe

Options: Reply•Quote

Re: Analysis of mining Frequent sequential pattern algorithm's

Posted by: Digvj.patel

Date: April 23, 2014 08:49PM

Hey roop,
even i am working on a project implementing parallel prefixspan algo.i am using sockets and the java executorservice. i am able to split the larger dataset into a smaller one but i am having issues with invoking all the clients at once over the network. could you plz help me getting around with it or mail your code to me?

digzou@gmail.com

Options: Reply•Quote

Re: Analysis of mining Frequent sequential pattern algorithm's

Posted by: vivek basati

Date: April 21, 2012 11:06PM

when i tried to execute SPAM in netbeans, with contextprefixspan.txt as input, im getting exception as

Exception in thread "AWT-EventQueue-0" java.lang.NoSuchMethodError: ca.pfv.spmf.sequentialpatterns.spam_saveToFile.AlgoSPAM.runAlgorithm(Ljava/lang/String;Ljava/lang/String;D)V
at ca.pfv.spmf.gui.MainWindow$4.actionPerformed(MainWindow.java:270)

plz help me out...

But in gui version im able to execute,problem executing in net beans

Options: Reply•Quote

Re: Analysis of mining Frequent sequential pattern algorithm's

Posted by: webmasterphilfv

Date: April 22, 2012 05:49AM

Hello Vivek,

I tried in Netbeans and it works on my computer.

I created a new projects and then i copied all the files of spmf081.zip into the src folder of the project.

Then, in Netbeans i right click on MainWindow to launch the GUI from Netbeans and I selected SPAM and the contextPrefixSpan.txt file.

It worked.

Maybe you did not copy all the files in netbeans. The "no such method exception" is perhaps because you did not copy all the files in your netbeans project. The solution is maybe to copy the files again to your netbeans project.

Hope this helps to solve your problem,

Philippe

Options: Reply•Quote

Re: Analysis of mining Frequent sequential pattern algorithm's

Posted by: malsoru

Date: May 05, 2014 05:10PM

Sir, Good Morning.
Please, mail Snake dataset.

Options: Reply•Quote

This forum is powered by Phorum and provided by P. Fournier-Viger (© 2012).
Terms of use.