The Data Mining Forum                             open-source data mining software data mining conferences Data Science for Social and Behavioral Analytics DSSBA 2022 data science journal
IMPORTANT: This is the old Data Mining forum.
I keep it online so that you can read the old messages.

Please post your new messages in the new forum: https://forum2.philippe-fournier-viger.com/index.php
 
Input file in GoKrimp
Posted by: rogelio andrade
Date: October 20, 2017 09:47AM

Hello,

I am trying to use the GoKrimp algorithm, but I dont get any oputput for my sequences database. Here is what I understand from the documentation at the webpage:

- Current version of GoKrimp does not work for a sequence of itemsets but for list of items.
Does it mean that 1 -1 1 2 3 -1 1 3 -1 4 -1 3 6 -1 -2 ISNT supported, but 1 -1 123 -1 13 -1 4 -1 36 -1 -2 IS OK??

- It requires that the event occurs at least in N=25 sequences to perform the test properly.
I am not sure I understand this part, does it mean that it should be a pattern which is at least 25 times in the sequence database?

My sequence database contains itemsets (I have tried GSP, SPAM, SPADE etc succesfully) hence I am simply deleting the space between the itegers in the itemset (just as I explained it above).

Any advice to undersant how I should structure my sequence database is appreciated!

Options: ReplyQuote
Re: Input file in GoKrimp
Date: October 21, 2017 12:20AM

Hello,

- Current version of GoKrimp does not work for a sequence of itemsets but for list of items.
Does it mean that 1 -1 1 2 3 -1 1 3 -1 4 -1 3 6 -1 -2 ISNT supported, but 1 -1 123 -1 13 -1 4 -1 36 -1 -2 IS OK??


I will explain what this means. A sequence can be viewed as a sequence of events. For example, a sequence 1 -1 2 3 -1 4 -1 -2 means that the event 1 appeared, was followed by events 2 and 3 simultaneously, and those were then followed by event 4.

Some algorithms in SPMF allow to consider these types of sequences where we can have simultaneous events.

However, the GoKrimp algorithm does not allow simulatenous events. This means, that you cannot have a sequence where the events (items) 2 and 3 are simultaneous using that algorithm.

Of course, you can use some tricks by transforming your data to try to avoid this limitation. However, by doing this, it may not work exactly as you want.

For example, as you said, you could transform a sequence such as:

1 -1 1 2 3 -1 -2

to:

1 -1 123 -1 -2

But then the algorithm will consider that there is an event (item) called "123". The algorithm will view these events as a single event rather than a set of simultaneous events. So this can create some problems in the results. For example, the algorithm would consider that the pattern 1 -1 1 is not appearing in the above sequence because the algorithm would not know that "123" includes the item (event) "1".

Besides, if you have some items that have values greater than 10, then you would have a problem. For example, if you transform

1 -1 1 11 12 -1 -2

to:

1 -1 11112 -1 -2

then how can you know which items appear in 11112? Is it 1, 11, 111, 112, or 12 ? You will thus lose some information.

So basically, it is a limitation of the GoKrimp algorithm that it cannot process sequences containing simultaneous events. As said above, you can transform your data to try to avoid this limitation but you will still some information by doing this.

- It requires that the event occurs at least in N=25 sequences to perform the test properly.
I am not sure I understand this part, does it mean that it should be a pattern which is at least 25 times in the sequence database?


I am not sure about this either. I am the founder of SPMF but I am the one who has designed the GoKrimp algorithm. I think that it means that each item appear in at least 25 sequences. But it would require to read the paper about GoKrimp or contact the author to make sure about that.

Best regards,

Philippe

Options: ReplyQuote


This forum is powered by Phorum and provided by P. Fournier-Viger (© 2012).
Terms of use.