Re: Input file in GoKrimp
Date: October 21, 2017 12:20AM
Hello,
- Current version of GoKrimp does not work for a sequence of itemsets but for list of items.
Does it mean that 1 -1 1 2 3 -1 1 3 -1 4 -1 3 6 -1 -2 ISNT supported, but 1 -1 123 -1 13 -1 4 -1 36 -1 -2 IS OK??
I will explain what this means. A sequence can be viewed as a sequence of events. For example, a sequence 1 -1 2 3 -1 4 -1 -2 means that the event 1 appeared, was followed by events 2 and 3 simultaneously, and those were then followed by event 4.
Some algorithms in SPMF allow to consider these types of sequences where we can have simultaneous events.
However, the GoKrimp algorithm does not allow simulatenous events. This means, that you cannot have a sequence where the events (items) 2 and 3 are simultaneous using that algorithm.
Of course, you can use some tricks by transforming your data to try to avoid this limitation. However, by doing this, it may not work exactly as you want.
For example, as you said, you could transform a sequence such as:
1 -1 1 2 3 -1 -2
to:
1 -1 123 -1 -2
But then the algorithm will consider that there is an event (item) called "123". The algorithm will view these events as a single event rather than a set of simultaneous events. So this can create some problems in the results. For example, the algorithm would consider that the pattern 1 -1 1 is not appearing in the above sequence because the algorithm would not know that "123" includes the item (event) "1".
Besides, if you have some items that have values greater than 10, then you would have a problem. For example, if you transform
1 -1 1 11 12 -1 -2
to:
1 -1 11112 -1 -2
then how can you know which items appear in 11112? Is it 1, 11, 111, 112, or 12 ? You will thus lose some information.
So basically, it is a limitation of the GoKrimp algorithm that it cannot process sequences containing simultaneous events. As said above, you can transform your data to try to avoid this limitation but you will still some information by doing this.
- It requires that the event occurs at least in N=25 sequences to perform the test properly.
I am not sure I understand this part, does it mean that it should be a pattern which is at least 25 times in the sequence database?
I am not sure about this either. I am the founder of SPMF but I am the one who has designed the GoKrimp algorithm. I think that it means that each item appear in at least 25 sequences. But it would require to read the paper about GoKrimp or contact the author to make sure about that.
Best regards,
Philippe