choosing minsup in SPM

The Data Mining Forum

IMPORTANT: This is the old Data Mining forum.
I keep it online so that you can read the old messages.

Please post your new messages in the new forum: https://forum2.philippe-fournier-viger.com/index.php

Goto Topic: Previous•Next

Goto: Forum List•Message List•New Topic•Search•Log In•Print View

choosing minsup in SPM

Posted by: jbd

Date: June 11, 2020 08:25AM

Hello,

Concerning your paper: "Fast Vertical Mining of Sequential Patterns Using Co-occurrence Information", I would like to ask you on what basis did you make your choice of the values of minsup for each dataset ?

Thank you,
jbd

Options: Reply•Quote

Re: choosing minsup in SPM

Posted by: webmasterphilfv

Date: June 11, 2020 04:02PM

Hi,

In that paper, for the experiments, the minsup threshold was varied so as to show the performance gap between the algorithms. I started with a high minsup threshold and lower it down until it became very slow and i could see clearly the difference between the algorithms in terms of performance.

Generally, the minsup threshold will be different on each dataset. On some datasets, a minsup of 0.8 may results in finding 0 patterns (e.g. the BMS dataset I think) and on other datasets like Snake, you may find perhaps millions of patterns for the same threshold value. It really depends on the characteristics of the datasets (whether the sequences are long, whether the sequences are very similar or not, if there are many different items...)

From a practical perspective, if you use these algorithms in an application, I would suggest to start from a high minsup value and decrease it until when you find enough patterns.

As minsup is decreased, the number of frequent patterns may increase exponentially, and the speed or memory usage of algorithms may also increase in a similar way because as minsup is decrease the part of the search space to be explored may also increase exponentially.

If you dont want to set the minimum support, there exists a more intuitive way, which is to find the top-k sequential patterns. Here, instead of using minsup, you just say that you want to find for example the top-500 most frequent patterns. An algorithm like TKS will do that. Internally, it will automatically adjust the minsup threshold to find exactly k patterns (e.g. the top 500 patterns). TKS is very similar to the CM-SPAM algorithm in the PAKDD 2014 that you have read. The main difference is that TKS ask the user to set k the number of patterns to be found instead of the minsup threshold. This is in my opinion more intuitive.

Hope this helps.

Best regards,

Philippe

Edited 1 time(s). Last edit at 06/11/2020 04:03PM by webmasterphilfv.

Options: Reply•Quote

Re: choosing minsup in SPM

Posted by: jbd

Date: June 11, 2020 11:41PM

This is very helpful. Thank you a lot for your detailed response.

I have also read your blog post yesterday concerning the equation that you propose for choosing minsup, and I find it very useful.

jbd

Options: Reply•Quote

Re: choosing minsup in SPM

Posted by: webmasterphilfv

Date: June 12, 2020 09:42AM

Hi again,

You are welcome!

Yes, that equation is another solution to that problem of how to set the minsup!

Best regards,

Philippe

Options: Reply•Quote