The Data Mining Forum                             open-source data mining software data mining conferences Data Science for Social and Behavioral Analytics DSSBA 2022 data science journal
IMPORTANT: This is the old Data Mining forum.
I keep it online so that you can read the old messages.

Please post your new messages in the new forum: https://forum2.philippe-fournier-viger.com/index.php
 
What algorithm performs better for large sequences?
Posted by: vrodriguezf
Date: December 20, 2018 01:55AM

Hi,

I am analyzing a sequence database composed of 1134 sequences. The total number of symbols is 74, and the average number of distinct symbols per sequence is 35.69, with a standard deviation of 2.65. I think that the most important feature in this dataset is the length of the sequences. The average length is 1466.60, with an standard deviation of 2172.60. The maximum length of a sequence is 9980.

Currently I am working in R, using the package TraMineR. I have tried a function of that package to extract frequent sub-sequences that relies on the algorithm proposed in [1], but the computational time is huge. It take days to compute patterns with a just a subsample of the database.

I just came across this framework and this forum, and I wonder if there is a sequential pattern mining algorithm known for working specially well with large sequences.

Many thanks in advance!
Víctor

----------------

[1] Masseglia, F., Cathala, F., & Poncelet, P. (1998, September). The PSP approach for mining sequential patterns. In European Symposium on Principles of Data Mining and Knowledge Discovery (pp. 176-184). Springer, Berlin, Heidelberg.

Options: ReplyQuote
Re: What algorithm performs better for large sequences?
Date: December 23, 2018 02:11AM

Hi,

Welcome to this forum and the SPMF software.

Yes, different algorithms will have different performance. PSP is a somewhat old algorithm. You could definitely try some newer algorithms offered in SPMF.

But I would like to give you some advices:
- You do not have that many sequences but the sequences are indeed very long. If the sequences are also very similar, it is possible that the search space is huge. A good way to reduce the search space and obtain better performance is to add some constraints such as a maximum length for patterns, or a maximum gap between event of a pattern. If you set such parameters, the algorithm can become much faster! For example, if you say that you only want patterns with no more than 3 items and no gap, it should be very fast.
In SPMF, algorithms such as CM-SPAM will let you add several constraints.
- How the parameters are set is also very important. For the minsup parameter, you should start from a high value and slowly decrease it until you have enough patterns.
- Another possibility is to transform the database to remove some events in your sequences that are actually unimportant. If you reduce the number of items, then the search space is reduced.

In SPMF, some algorithms will also let you specify that some items must appear in the patterns. For example, you can search for all patterns that must contain a given item. This will also reduce the search space. Algorithms like CM-SPAM and TKS provide this type of constraints.

Some algorithms in SPMF uses more or less memory. If it run out of memory, you have a few algorithms that you can try and adjust the paramters.

Hope this helps.

Best regards,

Options: ReplyQuote
Re: What algorithm performs better for large sequences?
Posted by: vrodriguezf
Date: January 08, 2019 03:45AM

Thank you for your reply!

As you said, the only way to make to execute an algorithm from SPMF using this sequence database without crashing was to introduce constraints.

One extra thing that I would like to ask, which is specially relevant for large sequences: Is it possible to change the way in which a pattern is counted? I am executing TKS over this database with 114 sequences, and I see that the top-K patterns found always achieve a support of 113-114. I guess that means that they appear at least once in each sequence.

But, is there a way to take into account not only if they appear or not appear, but also the number of occurrences?

Best!

Options: ReplyQuote
Re: What algorithm performs better for large sequences?
Date: January 08, 2019 08:16AM

Hi,

Currently, the number of occurrences in each sequence is not considered. If we consider the number of occurrences in each sequence, the problem become more complex and there could be multiple way of counting depending for example on if we allow occurrences to overlap, or keep only the minimal or maximal occurrences... It can get a bit complicated. For example, if you have a pattern ABC appearing in a sequence ABCDEFC we could say that ABC is appearing twice if overlapping occurrences are allowed but if we count this has twice, maybe it is not fair because there is a huge overlap between the two occurrences while in another sequence like ABCEFABC, the occurrences are not overlapping.

So currently most sequential pattern mining and sequential rule mining algorithm only consider whether a pattern appear or not in a sequence but do not count the number of occurrences. Adding this feature is certainly possible but not so simple and maybe require a few days of programming and also to make a clear definition about how to count. Also whether it can be computed efficiently using existing algorithms is another issue. Several algorithm will not really track all the information for counting many occurrences but just ensure that there is at least one.

Another way, could be to apply a post-processing step to calculate all occurrences after finding the patterns..

Best regards,

Options: ReplyQuote
Re: What algorithm performs better for large sequences?
Posted by: vrodriguezf
Date: January 09, 2019 06:30AM

Hi,

Yes, I understand that this is a complex issue indeed. A good paper that talks about different counting methods for sequential patterns is [1]. I saw this paper while using a popular R package for sequence analysis, TraMineR. In that package, the function seqefsub allows to search for frequent sequential patterns using different counting methods (Function documentation). Unfortunately, I tested it with my dataset and it crashed due to the relying underlying algorithm is quite old (I think it is PrefixSpan). That is why I think that the inclusion of more counting methods in SPMF could be of great help for some cases.

Best!

[1] Joshi, Mahesh V., George Karypis, and Vipin Kumar. "A universal formulation of sequential patterns." Proceedings of the KDD’2001 workshop on Temporal Data Mining. 1999.

Options: ReplyQuote
Re: What algorithm performs better for large sequences?
Date: January 09, 2019 04:50PM

Thanks for the reference to the paper. It is an interesting paper. Yes, that is what I was thinking by different counting methods for the occurrences. For some algorithms, I think it may not be easy such as SPAM. But for PrefixSpan, I think it would be possible but would require quite some work. Basically PrefixSpan only keep the first occurrence in each sequence. To modify PrefixSpan one would have to change how projected databases are constructed. If a pattern appears in a sequence, rather than creating a projected sequence, one would have to create a projected sequence for each occurrence. I think this would work but increase the memory quite a bit to track all the occurrences.

Options: ReplyQuote
Re: What algorithm performs better for large sequences?
Posted by: vrodriguezf
Date: January 10, 2019 01:17AM

Ok,

For now, since my idea is to use sequential pattern mining in order to reduce the size and dimensionality of my sequence database, I think I will go for compressing algorithms such as SeqKrimp and GoKrimp. Do these algorithms rely on a traditional counting method too, or just in the MDL principle?

Best!

Options: ReplyQuote
Re: What algorithm performs better for large sequences?
Date: January 10, 2019 06:55AM

Hi,

That is great. I am not sure. In SPMF, I have implemented many algorithms but the GoKrimp/SeqKrimp algorithms have not been implemented by me. I think they work quite differently from other algorithms but I have never read the details of the paper for those algorithms.

If you have question, you may contact the main author of the paper (Hoang Thanh Lam). He is quite friendly and I guess that he would answer your questions :-) Or you may also check the paper.

Best regards,



Edited 1 time(s). Last edit at 01/10/2019 06:55AM by webmasterphilfv.

Options: ReplyQuote


This forum is powered by Phorum and provided by P. Fournier-Viger (© 2012).
Terms of use.