The Data Mining Forum                             open-source data mining software data mining conferences Data Science for Social and Behavioral Analytics DSSBA 2022 data science journal
IMPORTANT: This is the old Data Mining forum.
I keep it online so that you can read the old messages.

Please post your new messages in the new forum: https://forum2.philippe-fournier-viger.com/index.php
 
something wrong with my input file?
Posted by: stefan
Date: April 25, 2019 07:20AM

Hi all,

I have a question about my input file. For some reason, i can't seem to get any decent output.

It a huge input file, so i tried to mimic what it looks like:

@CONVERTED_FROM_TEXT
@ITEM=98=20205AutoStopopen
@ITEM=5=10012Safetyguardstopstate
@ITEM=163=40538MaxTimeExpired
@ITEM=114=20314Enable2supervisionfault
@ITEM=126=20481SCOVRactive
@ITEM=85=20074Notallowedcommand
...
etc
@ITEM=-1=|

211 211 211 211 211 211 211 39 24 25 50 211 39 50 25 24 -2
39 50 25 24 39 5 98 3 24 4 50 25 39 25 50 24 39 24 50 25 39 5 98 3 25 50 24 -2
58 59 60 67 66 58 59 60 39 98 5 3 4 25 50 24 460 460 460 460 460 -2
....
etc etc

I have 491 integers, all mapped to these string values. Some things that i spot that maybe are resulting in 0 sequences found:

- the int --> strings are not in order at the top of my dataset
- i have to splits in my sequence that are used as a -1 like:
1 -1 1 2 3 -1 1 3 -1 4 -1 3 6 -1 -2

But i don't have those splits in my dataset. Each line is one huge sequence.

- I am using the wrong algorithm (PrefixSpan, CM_SPAM)

My goal is to find closed sequential patterns. What am i doing wrong? smiling smiley

Options: ReplyQuote
Re: something wrong with my input file?
Date: April 27, 2019 05:19PM

Hi,

Thanks for using SPMF.

Here are my comments:


- the int --> strings are not in order at the top of my dataset

This should not be a problem.

- i have to splits in my sequence that are used as a -1 like:
1 -1 1 2 3 -1 1 3 -1 4 -1 3 6 -1 -2


Yes, if there is no -1, then it is the most likely reason why you do not get any pattern. If I remember well, when the algorithm reaches the -1, it will add the items to the sequence. But if there is no -1, then it will be considered like an empty sequence, so there will be no pattern. Thus, it is important to fix this.

Besides that, in each itemset of the sequences, the items should be ordered according to some order such as the alphabetical order. Otherwise, the algorithm may incorrectly calculate patterns. For example, this is OK:

1 2 3 -1 1 2 -1 -2 because in each itemset, there is some order 1 > 2 > 3

But if you have a sequence like this:

1 2 3 -1 2 1 -1 -2

it is not correct because in the first itemset, the order is 1 > 2 > .. while in the second itemset the order is 2 > 1.

Another potential problem is that in each itemset, it is not allowed to have the same item twice. For example:

1 1 1 1 1 -1 2 -1 -2

is incorrect because in the first itemset, the item 1 appears five times!

That is all. :-)

For the algorithms, it is fine. But you need to fix the input file to get some results.

Best regards,

Philippe

Options: ReplyQuote
Re: something wrong with my input file?
Posted by: Stefan
Date: April 28, 2019 03:05AM

Hi Philippe,

Thanks for the response, i really appreciate the work that you put in to this software!

Unfortunately, there are no -1 splits in my sequences. Also, it is natural to the data that there will be the same events twice. Also, the ordering cannot be altered. Each line is a long sequence without stops (some lines have 50.000+ events). So, will i then not be able to use this software? My goal is to find a pattern in several sequences, for example:
1 2 3 1 2 3 2 3 2 3 4 5 6 4 5 5 6 6 6 5 -2
3 2 1 1 2 3 2 3 1 2 3 4 5 6 4 5 6 5 5 4 -2

would result in the sequential rules:
1 2 3
4 5 6
5 6
4 5
3 4
etc...

Otherwise i think i will have to stick with the arulesSequences package in R. However, they do not provide closed sequential pattern mining and your software does! The problem is that i receive 2700 rules from R, and almost all of them are subsets of other rules...

thanks in advance for the help!

Options: ReplyQuote
Re: something wrong with my input file?
Date: April 29, 2019 07:11AM

Hi Stefan,

I just want to make clear some concepts for sequential pattern mining because maybe there is some confusion.

A sequence is a list of itemsets, where each itemset is a set of items (events).

In a sequence, the itemsets are ordered (e.g. by time). But the items in an itemset are not ordered and are considered to be simultaneous. However, many implementations such as those in SPMF require that items whithin an itemset are still ordered but this is just for the purpose of processing (we can use some implementation optimizations). The order whithin an itemset can be any order such as the alphabetical order. But there has to be some order inside each itemset to respect the input format.

So basically, a sequence is a list of itemsets (sets of events) that are ordered, and items whithin an itemset represent simultaneous events. Thus using this model, you can have some events that are simultaneous or that follow each other.

In the format used by SPMF, the -1 is used to separate the itemsets. If you have a sequence like this in SPMF:

1 2 3 -1 -2

It means that events 3 2 and 1 are simultaneous!

But if you want to say that event 1 was followed by 2 and then by 3, you would have this sequence:

1 -1 2 -1 3 -1 -2

Or if you want to say that event 1 and 2 occurred at the same time and were followed by 4, then you would encode this as:

1 2 -1 4 -1 -2

But in SPMF, it is not allowed to have the same event twice whithin the same itemset. Thus this is not allowed:

1 1 -1 2 -1 -2 (this sequence would mean that 1 and 1 occurred at the same time and were followed by 2)

I think that what you are looking for would perhaps just require to insert -1 between your events?

Best regards,

Philippe



Edited 2 time(s). Last edit at 04/29/2019 07:13AM by webmasterphilfv.

Options: ReplyQuote


This forum is powered by Phorum and provided by P. Fournier-Viger (© 2012).
Terms of use.