Re: ask for help the output of TopSeqRules algorithm
Date: April 03, 2020 08:38AM
Hi again,
You are welcome. Glad it is clear.
I will explain for the parameters. First, I will explain for TopSeqRules, and then for TNS.
The TopSeqRules algorithm will find the top-k most frequent sequential rules. The parameter k means the number of rules that you want to find. Typically in my experiments for performance, I have used values of k up to a few thousands. But realistically, if you are going to interpret these rules to learn something about your data, thousands of rules might be too much. I would recommend to start perhaps with something like k = 100 and then if the rules do not appear to be very interesting, you may increase the parameter k to have more rules.
If you set k = 100, the algorithm will give you the top 100. If you set k = 1000, it will give you the top 1000, etc.
The TopSeqRules algorithm is an exact algorithm, which means that it will find all the rules meeting the constraints set by the k and minconf parameters.
Now a problem with TopSeqRules is that it may find some rules that appear to have some kind of redundancy. For example, two rules A -> B and A-> B C may have exactly the same support and confidence, and thus someone may not be interested in finding these two rules for that reason.
So to avoid this problem, we have proposed the TNS algorithm, which finds the top-k most frequent non redundant sequential rules. The algorithms works exactly in the same way as TopSeqRules with the difference that TNS eliminates some rules that are said to be redundant.
Because eliminating redundant rules is not easy, the TNS algorithm is an approximate algorithm. It tries to find the top-k most frequent "non redundant rules but it is possible that it misses some rules. To avoid missing rules, we have added the "delta" parameter, which act as a kind of buffer. I will explain it in a simple way.
If you set k = 500 and delta = 0, TNS will try to find the top-500 rules and return what it has found.
If you set k = 500 and delta = 300, TNS will try to find the top-800 rules and return the 500 best rules that are non redundant.
Thus, no matter how you set delta, TNS will try to return k rules. But the difference is that if you increase delta, you will increase the probability that you will not miss any rules.
So, in my opinion, you can perhaps set delta to twice the value of k. That could give some good results.
Generally,
- if you set k or delta to higher values, the algorithm may run for longer time, and use more memory. Thus, it is recommended to start with relatively small values and then increase if not enough patterns are found.
- if you set k higher, you will find more patterns
- if you set delta higher, you are more likely to find the exact result
That is the main idea! Hope you find some good patterns in your data!
Best regards,
Philippe