The Data Mining Forum                             open-source data mining software data mining conferences Data Science for Social and Behavioral Analytics DSSBA 2022 data science journal
IMPORTANT: This is the old Data Mining forum.
I keep it online so that you can read the old messages.

Please post your new messages in the new forum: https://forum2.philippe-fournier-viger.com/index.php
 
PrefixSpan:
Posted by: saiph
Date: June 16, 2013 12:28PM

Hi,

I am trying prefixSpan like this:

$ java -jar spmf.jar run PrefixSpan_with_strings complete-operations.txt output.txt 10% 10000

But it takes forever with this file complete-operations.txt: http://pastebin.mozilla.org/2532554

Am I doing something wrong?

Thanks.

Options: ReplyQuote
Re: PrefixSpan:
Date: June 16, 2013 05:47PM

Hi,

No you are using correctly the software.

Actually, the performance of this kind of algorithms depends a lot on the data and the parameters.

Your input file is not very large, but the sequences are long and dense (many items shared by several sequence), so there may be a LOT of patters and that is why the algorithm become slow.

To find the proper minsup value, it is recommended to start with a high value and then to go down slowly. I have tried with your data. For minsup =0.18, the algorithm returns 17 patterns. But for minsup = 0.175, the algorithm suddenly become very slow. There seems to be a lot of patterns at this value that would slow down the algorithm.

How to solve this problem? You could use the Hirate-Yamana algorithm in the latest release of SPMF that allow to use length constraint and gap constraint for example. If you add these constraints, it will improve the speed. Another possibility is to preprocess your data to remove some information that is not important and that may slow down the algorithm. The more that the sequences are the same, the slower the algorithm will be. So sometimes, it make senses to make some preprocessing to prepare your data before applying the algorithm!

Hope this help!

Philippe

Options: ReplyQuote
Re: PrefixSpan:
Posted by: saiph
Date: June 19, 2013 08:50AM

I got an out of memory error executing prefixSpan for a long time with the command I described above (I will try to increase the heap size on the next execution).


Hirate-Yamana does not support strings yet right?

«$ java -jar spmf.jar run HirateYamana hy-complete-operations.txt output.txt 0.20 0 100 0 100
java.lang.NumberFormatException: For input string: "DISTRICT:2:1:k"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:481)
at java.lang.Integer.parseInt(Integer.java:514)
at ca.pfv.spmf.algorithms.sequentialpatterns.fournier2008.SequenceDatabase.processSequence(SequenceDatabase.java:141)
at ca.pfv.spmf.algorithms.sequentialpatterns.fournier2008.SequenceDatabase.loadFile(SequenceDatabase.java:70)
at ca.pfv.spmf.gui.MainWindow.runAlgorithm(MainWindow.java:1278)
at ca.pfv.spmf.gui.MainWindow.commandLine(MainWindow.java:2040)
at ca.pfv.spmf.gui.MainWindow.main(MainWindow.java:190)»

Thanks

Options: ReplyQuote
Re: PrefixSpan:
Date: June 19, 2013 10:06AM

Hi Saiph,

If it runs out of memory it means that the search space is too large. One reason may be that there is too many patterns or that the patterns are too long.

Increase the heap size is a good idea. Besides, there are several ways to reduce the number of patterns and thus to improve the performance.
  • increase the minsup parameter from 0.2 to something larger. You can try some high value like 0.9 and then decrease until you find a good value.
  • lower the "maximum whole interval" parameter from 100 to something smaller. For example, you could put 5 or 10. Maybe it would be enough.
  • lower the "maximum item interval" parameter from 100 to something smaller
  • do some preprocessing on your data to eliminate some unecessary information or split your data in half for example.

Yes, the Hirate & Yamana do not accept strings yet. Now i'm a little bit busy. But this summer I will probably add a tool to convert text to sequence database. By the way, I have added a few new datasets in the datasets section of the website (note that some of them are large and may not work well with some algorithms)!

Best,
Philippe



Edited 1 time(s). Last edit at 06/19/2013 10:07AM by webmasterphilfv.

Options: ReplyQuote
out of memory
Posted by: Dhyanesh Parmar
Date: February 02, 2014 12:08PM

While running prefixspan algo it got out of an memory
So how can i increase heapsize in java and i am using NetBeans IDE 6.9.1 and i try for VMoption in run tab of netbeans but it is not working so how can i increase it????

Options: ReplyQuote
Re: out of memory
Date: February 02, 2014 12:20PM

Hi,

I don't have NetBeans on my machine (I work on Eclipse) so I searched a little bit on Google and found this:

Increase heap size for running a program in NetBeans to 1024 mb.
1. Right click on your project "Properties"
2. Select "Run" category.
3. Enter your arguments(-Xmx1024m) in the "VM Options" text box.

Another way:
1. Goto project Properties window. set addition compiler option in Build>Compiling tab to -Xmx1024m

By the way, I'm not very familiar with NetBeans. But In Eclipse we can associate a heap size to each Java file that are run whithin a project. Therefore, I think that you should also check that you are increasing the heap size for the project or the file that you are running.

Lastly, It is possible that the algorithm is running out of memory even if you increase the heap size. This depends on the parameters that you use and on your data. The lower minsup is set the more patterns are found. Moreover, the more sequences and the longer sequence, the more memory may be used by PrefixSpan. Performance depends a lot on data and parameters for this kind of algorithms.

Best,

Philippe

Options: ReplyQuote


This forum is powered by Phorum and provided by P. Fournier-Viger (© 2012).
Terms of use.