The Data Mining Forum                             open-source data mining software data mining conferences Data Science for Social and Behavioral Analytics DSSBA 2022 data science journal
IMPORTANT: This is the old Data Mining forum.
I keep it online so that you can read the old messages.

Please post your new messages in the new forum: https://forum2.philippe-fournier-viger.com/index.php
 
SPAM: sequential mining on string data / with sequence labels
Posted by: z
Date: December 24, 2014 06:59PM

Hi,
I am trying to mine sequential patterns. It seems that in SPMF, sequential mining algorithms only work with integer ?
How can can I mine sequential patterns with string?
Code the strinn to integer is it the only to do it?

Thanks.



Edited 1 time(s). Last edit at 01/06/2015 06:59AM by webmasterphilfv.

Options: ReplyQuote
Re: SPAM: sequantial mining on string
Date: December 25, 2014 03:36AM

Hi,

Yes, it is currently the only way to do it.

Algorithms in SPMF uses integers for efficiency reasons (it is much faster and use less memory). I have often thought about adding a pre-processing tool to convert texts to sequences of integers. But I have not done it yet. It is a feature on the "to do" list, but i have not had time to do it yet.

Best,

Philippe

Options: ReplyQuote
Re: SPAM: sequantial mining on string
Posted by: zsoh
Date: December 26, 2014 07:19AM

Hi,
Thanks a lot for your quick answer.
By the way, I would like to know if there is any way to label the sequences (input) and to know in which sequence a pattern occurs. For example after the support, list the labels in which a sequence occurs.

Thanks,
Zéphyrin

Options: ReplyQuote
Re: SPAM: sequantial mining on string
Date: December 26, 2014 04:10PM

Hi,

Yes, most of the algorithms internally use labels. However, I do not output them to avoid generating very large output files. Which algorithm are you using?

If you tell me which algorithm you are using, I can tell you how to modify the source code to show the labels, or add the feature.

Best,

Options: ReplyQuote
Re: SPAM: sequantial mining on string
Posted by: zsoh
Date: December 27, 2014 06:29AM

Hi,
Thanks a lot for your support.
I start using SPAM to be familiar with the tools.
But my goal is to mine close sequential patterns without time constrains and dimensions in one hand (BIDE, Clasp) and then multi-dimensional in other hand (SeqDim_(BIDE+AprioriClose)).
Thank you for your map of data mining algorithm ... I follow it :-)

By the way, any advice for the more appropriate algorithm? For example, I have subjects following a set of steps (and/or activities) to do something. I put the steps/activities they followed in sequences and I would like to know if some subjects follow a pattern of steps/activities.

Thanks you in advance.
Zéphyrin

Options: ReplyQuote
Re: SPAM: sequantial mining on string
Posted by: zsoh
Date: December 30, 2014 11:45AM

Hi,
I am using Fournier08-Closed+time (Example 69 in the documentation).
Could you please let me know how to add sequence labels in the output?

Thanks,
Zéphyrin

Options: ReplyQuote
Re: SPAM: sequantial mining on string
Date: December 30, 2014 03:08PM

Sure. Just give me about 24 hours to answer you. I'm currently abroad and will take airplane from China to Canada which takes about 24 hours in airport/airplane time.

Sorry about the delay.

Philippe

Options: ReplyQuote
Re: SPAM: sequantial mining on string
Date: January 01, 2015 11:42PM

Hi Zéphyrin,

Sorry again about the delay.

Here is the answer to how to add the labels.

Let's say that we consider Example #69 of the documentation as you have requested.

To run this example from the source code, we can use MainTestSequentialPatternMining2_saveToFile.

The output looks like this:

<0> 2 -1 <1> 1 3 -1  #SUP: 3 
<0> 2 -1 <1> 1 2 -1  #SUP: 3 
<0> 2 -1 <1> 1 -1  #SUP: 4 
<0> 1 -1 <1> 1 2 -1  #SUP: 3 
<0> 1 2 -1 <1> 1 -1  #SUP: 3 
<0> 1 2 3 -1  #SUP: 3 
<0> 1 2 -1  #SUP: 4


Now, if you want to show the labels indicating in which sequences these patterns appear, all you need to do is modify the file AlgoFournierViger08 located in the package ca.pfv.spmf.algorithms.sequentialpatterns.fournier2008_seqdim so that these lines are uncommented:

//  print the list of Pattern IDs that contains this pattern.
			if(prefix.getSequencesID() != null){
				r.append(" #SID: "winking smiley;
				for(Integer id : prefix.getSequencesID()){
					r.append(id);
					r.append(' ');
				}
			}

After, you have done this, you can execute the example again, and the labels will be in the output file:

<0> 2 -1 <1> 1 3 -1  #SUP: 3 #SID: 0 1 3 
<0> 2 -1 <1> 1 2 -1  #SUP: 3 #SID: 1 2 3 
<0> 2 -1 <1> 1 -1  #SUP: 4 #SID: 0 1 2 3 
<0> 1 -1 <1> 1 2 -1  #SUP: 3 #SID: 0 1 2 
<0> 1 2 -1 <1> 1 -1  #SUP: 3 #SID: 0 1 2 
<0> 1 2 3 -1  #SUP: 3 #SID: 0 1 3 
<0> 1 2 -1  #SUP: 4 #SID: 0 1 2 3

Here, for example, the first line indicates "#SID: 0 1 3 " which means that the pattern on this line appear in sequence 0, sequence 1 and sequence 3 (because sequences are numbered as 0,1,2,3).

Hope this helps,

Philippe

Options: ReplyQuote
Re: SPAM: sequantial mining on string
Posted by: zsoh
Date: January 04, 2015 10:10PM

Dear Philippe,
Thank you so much for your help.
I try to understand how sequences are numbered (0,1,2,3,...).
In SequenceDatabase.processSequence(), when you instanciate a sequence, seems like you define the ID as the size of the sequence .. new Sequence(sequences.size())?
But surprise in the output, the sequence are numbered 0,1,2 ...

My goal is to see if I can customized the sequence's ID. For example add the ID at the begining of each sequence i.e., seqID: itemset1 -1 itemset2 -1 ...

Could you please clarify about sequence IDs?

Thank you,

Zéphyrin

Options: ReplyQuote
Re: SPAM: sequantial mining on string
Date: January 05, 2015 01:15AM

Hi,

Sequences are numbered according to their order in the input file.

The first one is 0. The second one is 1 and so on...

In the code, you have seen this line of code:

Sequence sequence = new Sequence(sequences.size());

But I think that you have confused the "sequence" variable with the "sequences" variable. The "sequences" variable contains all the sequences that have been read until now. And the "sequence" variable contain the current sequence that is being read.

Thus, if you look closely at the previous line, the id is not the size of the sequence. It is the number of sequences that have been read until now. So for the first sequence, sequences.size() will return 0 because no sequences have been read before. For the second sequence, sequences.size() will return 1 since 1 sequence has been read previously, etc. Thus, the ids will be 0,1,2...

I think that it would not be too hard to replace these ids with something else. The only requirement is that each sequence has a distinct id otherwise the algorithm will not work properly. But the id could be any kind of integers, string or whatever you want, as long as you follow this rule.

Best,

Philippe



Edited 1 time(s). Last edit at 01/05/2015 01:17AM by webmasterphilfv.

Options: ReplyQuote
Re: SPAM: sequantial mining on string
Posted by: zsoh
Date: January 05, 2015 04:38AM

Thank you very much,
I was confuse. Now it's fine with your details.

Regards,
Zéphyrin

Options: ReplyQuote
Re: SPAM: sequantial mining on string
Date: January 05, 2015 04:57AM

You are welcome.

By the way, I think that you speak French just like me. So if you have any more questions, you may also ask me in French if you prefer to use French or we may use English. No problem for me ;-)



Edited 1 time(s). Last edit at 01/05/2015 04:58AM by webmasterphilfv.

Options: ReplyQuote
Re: SPAM: sequantial mining on string
Posted by: zsoh
Date: January 05, 2015 07:55AM

Thanks/Merci!
I will let you know (possibly in French :-)) if I face other problems.

Merci d'avance pour la disponibilité!
Zéphyrin

Options: ReplyQuote


This forum is powered by Phorum and provided by P. Fournier-Viger (© 2012).
Terms of use.