Re: Pattern Mining for very large sequences
Date: July 24, 2013 05:54PM
Hi,
Welcome to the forum. I'm glad that you like the software.
For your application, I think that SPAM would be a better algorithm than PrefixSpan. It produces the same output. But in general SPAM is fasterthan PrefixSpan for dense datasets and I think that your dataset is a dense dataset because you probably have a small number of items that appear very frequently in each sequence.
Besides changing algorithms, you can try:
- raising the minsup parameter higher. If minsup is set higher, the algorithm will be faster. You can start with a high value and then lower it down.
- If some data is not important, you could make some preprocessing to filter out the unnecessary data and the algorithm will be faster.
- Also, in the latest version of SPMF (from last month), there is a max length parameter for sequential pattern mining with SPAM and PrefixSpan. If you set it to a smaller value, the algorithm will be faster. You can start with a small value such as 2 and then increase it to find longer patterns.
- also would it make sense to split your very long sequence? If you could split them or just consider a subset, it could make the algorithms faster.
That is a few ideas that you can try, and that I can think of now.
Hope that this helps,
Philippe
Edited 2 time(s). Last edit at 07/24/2013 05:56PM by webmasterphilfv.