Pattern Mining for very large sequences

The Data Mining Forum

IMPORTANT: This is the old Data Mining forum.
I keep it online so that you can read the old messages.

Please post your new messages in the new forum: https://forum2.philippe-fournier-viger.com/index.php

Goto Topic: Previous•Next

Goto: Forum List•Message List•New Topic•Search•Log In•Print View

Pattern Mining for very large sequences

Posted by: Thomas Shippey

Date: July 24, 2013 05:06PM

Hi

I very much like your work on Sequential pattern mining and your tool has been of much help to me! =)

I am currently trying to mine very large sequences. These sequences are like DNA sequences, only instead of being built around 4 base pairs, they could be made up of 92 'base pairs'! The sequences can vary from 1 to 1000s of transactions in length.

I have been trying to use PrefixSpan however it is proving to be very very slow.

Is there a better sequential pattern mining algorithm that I should be using? Sorry if I am being a bit naive in this subject, my main area of research is in defect prediction.

Thanks for your help in this matter.

Thomas Shippey

Options: Reply•Quote

Re: Pattern Mining for very large sequences

Posted by: webmasterphilfv

Date: July 24, 2013 05:54PM

Hi,

Welcome to the forum. I'm glad that you like the software.

For your application, I think that SPAM would be a better algorithm than PrefixSpan. It produces the same output. But in general SPAM is fasterthan PrefixSpan for dense datasets and I think that your dataset is a dense dataset because you probably have a small number of items that appear very frequently in each sequence.

Besides changing algorithms, you can try:
- raising the minsup parameter higher. If minsup is set higher, the algorithm will be faster. You can start with a high value and then lower it down.
- If some data is not important, you could make some preprocessing to filter out the unnecessary data and the algorithm will be faster.
- Also, in the latest version of SPMF (from last month), there is a max length parameter for sequential pattern mining with SPAM and PrefixSpan. If you set it to a smaller value, the algorithm will be faster. You can start with a small value such as 2 and then increase it to find longer patterns.
- also would it make sense to split your very long sequence? If you could split them or just consider a subset, it could make the algorithms faster.

That is a few ideas that you can try, and that I can think of now.

Hope that this helps,

Philippe

Edited 2 time(s). Last edit at 07/24/2013 05:56PM by webmasterphilfv.

Options: Reply•Quote

Re: Pattern Mining for very large sequences

Posted by: Thomas Shippey

Date: July 25, 2013 07:07AM

Thank you for your swift reply!

I will try out your suggestions. I will let you know how I get on.

Regards

Thomas

Options: Reply•Quote

Re: Pattern Mining for very large sequences

Posted by: Thomas Shippey

Date: August 08, 2013 08:39AM

Hi Philippe

I have been trying out your suggestions and it seems the simplest is best.

I changed to SPAM and that helped reduce the running time drastically from PrefixSpan. Even with a very low minsup.

I did some analysis on the most common patterns and no pattern over around 12 appears in near enough of the sequences to make a significant result. So I was able to limit the max pattern length to 12. Which has helped to make my future analysis more efficient.

Thanks for your great help

Thomas
PhD Student
University of Hertfordshire

Options: Reply•Quote

Re: Pattern Mining for very large sequences

Posted by: webmasterphilfv

Date: August 08, 2013 08:47AM

Hi Thomas,

I'm glad to know that you are having better results now. Thanks for letting me know!

Best,

Philippe

Options: Reply•Quote