The Data Mining Forum                             open-source data mining software data mining conferences Data Science for Social and Behavioral Analytics DSSBA 2022 data science journal
IMPORTANT: This is the old Data Mining forum.
I keep it online so that you can read the old messages.

Please post your new messages in the new forum: https://forum2.philippe-fournier-viger.com/index.php
 
ask for help the output of TopSeqRules algorithm
Posted by: xiaowei
Date: April 03, 2020 02:42AM

Hi everyone,

I am trying to use the TopSeqRules algorithm to analyze travel diaries to identify effective sequential rules.

I have constructed a sequence database and I want to follow the example explaining how to run the TopSeqRules algorithm(http://www.philippe-fournier-viger.com/spmf/TopSeqRules.php) using the graphical interface. But when I click "Run algorithm", the output is empty and the sequential rule count is 0. I don't find the reason, could you give me some advice?

One itemset is included in a sequence in my input file,
eg., 1 2 3 -1 -2
1 2 3 -1 -2
1 2 3 -1 -2
2 3 -1 -2 the output is empty.

And I have tried to change the first sequence,

1 2 3 -1 3 -1 -2
1 2 3 -1 -2
1 2 3 -1 -2
2 3 -1 -2 the output is normal and not empty.

Why is that? does it have to include at least two itemsets in one sequence?

Options: ReplyQuote
Re: ask for help the output of TopSeqRules algorithm
Date: April 03, 2020 04:58AM

Hi Xiaowei,

Thanks for using SPMF and TopSeqRules! The topic of analyzing this data is interesting.

I will provide some explanation. The input of a sequential rule mining algorithm is a sequence database. A sequence database is a set of sequences. Each sequence is a list of itemsets ordered by time. An itemset contains one more more items that are considered to appear at the same time. For example, a sequence:

1 -1 2 -1 -2

means that the item 1 appeared and was then followed by 2.

Another example, if you have a sequence:

1 3 -1 2 -1 -2

It means that item 1 and 3 appeared at the same time, and were then followedy by item 2.

If you have a sequence with a single itemset like this:

1 2 -1 -2

it means that the items (1 and 2) appeared at the same time.


The goal of finding sequential rules is to find rules of the form X --> Y, which are interpreted as if some items X appear, then they will be followed by the items of Y. Thus, according to that definition a sequential rule indicates that something (Y) will happen after something else (X).

If you have a sequence with a single itemset, then it is impossible to find a sequential rule because all items in that sequence are assumed to be simultaneous. If you have at least two itemsets, then you may find some rules.
For example, a sequence:

1 -1 2 -1 -2

contains the rule 1 --> 2 because 1 appears before 2.

Hope that this is clear!

So in other words, to find sequential rules, you need to have some event(s) that appear before some other event(s).

If you don't care about the time ordering, then you could look at other algorithms like association rule mining. Those algorithms do not care about the time ordering. The input format is a little different.

Best regards

Options: ReplyQuote
Re: ask for help the output of TopSeqRules algorithm
Posted by: xiaowei
Date: April 03, 2020 05:34AM

Thanks a lot.

Your explanations are very clear, and I suddenly understand that my sequences are converted in the text file incorrectly.

Also, could you tell me how to set the number of K and delta before running the TNS algorithm? what is the set experience or criteria?


Looking forward to your reply.

Kind regards,
xiaowei

Options: ReplyQuote
Re: ask for help the output of TopSeqRules algorithm
Date: April 03, 2020 08:38AM

Hi again,

You are welcome. Glad it is clear.

I will explain for the parameters. First, I will explain for TopSeqRules, and then for TNS.

The TopSeqRules algorithm will find the top-k most frequent sequential rules. The parameter k means the number of rules that you want to find. Typically in my experiments for performance, I have used values of k up to a few thousands. But realistically, if you are going to interpret these rules to learn something about your data, thousands of rules might be too much. I would recommend to start perhaps with something like k = 100 and then if the rules do not appear to be very interesting, you may increase the parameter k to have more rules.

If you set k = 100, the algorithm will give you the top 100. If you set k = 1000, it will give you the top 1000, etc.

The TopSeqRules algorithm is an exact algorithm, which means that it will find all the rules meeting the constraints set by the k and minconf parameters.

Now a problem with TopSeqRules is that it may find some rules that appear to have some kind of redundancy. For example, two rules A -> B and A-> B C may have exactly the same support and confidence, and thus someone may not be interested in finding these two rules for that reason.

So to avoid this problem, we have proposed the TNS algorithm, which finds the top-k most frequent non redundant sequential rules. The algorithms works exactly in the same way as TopSeqRules with the difference that TNS eliminates some rules that are said to be redundant.

Because eliminating redundant rules is not easy, the TNS algorithm is an approximate algorithm. It tries to find the top-k most frequent "non redundant rules but it is possible that it misses some rules. To avoid missing rules, we have added the "delta" parameter, which act as a kind of buffer. I will explain it in a simple way.

If you set k = 500 and delta = 0, TNS will try to find the top-500 rules and return what it has found.

If you set k = 500 and delta = 300, TNS will try to find the top-800 rules and return the 500 best rules that are non redundant.

Thus, no matter how you set delta, TNS will try to return k rules. But the difference is that if you increase delta, you will increase the probability that you will not miss any rules.

So, in my opinion, you can perhaps set delta to twice the value of k. That could give some good results.

Generally,
- if you set k or delta to higher values, the algorithm may run for longer time, and use more memory. Thus, it is recommended to start with relatively small values and then increase if not enough patterns are found.
- if you set k higher, you will find more patterns
- if you set delta higher, you are more likely to find the exact result

That is the main idea! Hope you find some good patterns in your data!

Best regards,

Philippe

Options: ReplyQuote
Re: ask for help the output of TopSeqRules algorithm
Posted by: xiaowei
Date: April 03, 2020 06:50PM

Thanks for your detailed and timely answers. It is very clear and I learn a lot from it. Thank you very much again.

Options: ReplyQuote
Re: ask for help the output of TopSeqRules algorithm
Posted by: tassieTom
Date: April 04, 2020 05:29PM

I had an additional question in this post, but solved it--ignore my post here please.
Tom Berger



Edited 1 time(s). Last edit at 04/04/2020 11:01PM by tassieTom.

Options: ReplyQuote
Re: ask for help the output of TopSeqRules algorithm
Date: April 05, 2020 05:50AM

Hi Tom,

Interesting to know that you are using the same algorithm!

TNS is designed to find the top-k non redundant sequential rules

But because TNS is an approximate algorithm, it means that TNS can sometimes miss some rules because of its search space pruning strategies. So, if you set k = 5, TNS should show you 5 rules but perhaps that these rules will not be the top rules. Maybe it will miss some of the top rules.

So this is the reason why I have introduced the delta parameter. It is to increase the chance that you will not miss any top rules. The larger you will increase delta, the less likely that you will miss some top-k rules.

If you set k = 8 and delta = 0, TNS will maybe give you 8 rules but maybe those are not exactly the top-8. Maybe some of these rules are not from the top-8 but almost in the top-8.

Now, if you set k =8 and delta = 1000, you will be very likely to get the top-8 rules because internally, TNS will try to find the top 1008 rules, and then among those, it will give you the best 8.

Using the parameter delta also decrease the performance because TNS needs to search for more rules. So it may use more memory and have a longer runtime for large delta values.

So, if the algorithm is fast enough, I would recommend to increase delta. It should give you a result that is more likely to be the real top-k rules.

About consistency, do you mean consistency accross datasets? or consistency in terms of runtime?

Best regards,

Philippe



Edited 1 time(s). Last edit at 04/05/2020 07:23PM by webmasterphilfv.

Options: ReplyQuote
Re: ask for help the output of TopSeqRules algorithm
Posted by: tassieTom
Date: April 05, 2020 05:41PM

Hi Philippe,

Thanks for taking the time to answer this.
Your response has been very helpful.
By consistent I meant the same output with the same input.
It looks a bit like a monte carlo run where there is variation in output, but you explained it very well.

I will post what I am using it for a little later today, then you will see what I mean.
Regards,
Tom Berger

Options: ReplyQuote
Re: ask for help the output of TopSeqRules algorithm
Date: April 10, 2020 07:17AM

Hi Tom,

I answered your other message but did not answer that one yet. So here I am.

Yes, there can be a bit of randomness in the results. It is because the TNS algorithm uses some data structure internally that is a RedBlackTree to decide which patterns to explore first, and the order may change in that tree from an execution to another if some patterns have exactly the same support. It is because when two patterns have the same support and confidence, we use the hashcode() of each pattern to compare them and select which one to explore first. But the hashcode() method may give different result from an execution to another. I think that this is the main reason.

Maybe it could be fixed by not using hashcode() but defining another order on the rules instead.

Best regards,

Philippe

Options: ReplyQuote


This forum is powered by Phorum and provided by P. Fournier-Viger (© 2012).
Terms of use.