The Data Mining Forum                             open-source data mining software data mining conferences Data Science for Social and Behavioral Analytics DSSBA 2022 data science journal
IMPORTANT: This is the old Data Mining forum.
I keep it online so that you can read the old messages.

Please post your new messages in the new forum: https://forum2.philippe-fournier-viger.com/index.php
 
Mining patterns in a single sequence
Posted by: Enrique
Date: December 23, 2020 01:21AM

Hi all,

I am a researcher in another field (computational fluid dynamics), but in a certain project in which I participate I need to find repeated patterns in a given sequence. My data structure is as follows: given a certain alphabet (e.g. {A, B, C, D, E}), I have a SINGLE sequence (e.g. "ABCCEEADEBCEABDCCEEBBDEAECEBCCEEBDE" ).I want to find the closed patterns in this sequence with a minimum threshold support (e.g. more than 2 repetitions), and whose length is greater or equal to lMin (e.g. 3) and less than lMax (e.g. 10). The desirable result in this example will be "CCEE", together with the positions of of the repetitions within the sequence.

I have been reading pattern mining literature for a couple of weeks (I am a complete novice), until I found this library and this forum. I have some questions that maybe you can help me to solve:

- The input format for the algorithms in this library is a "transaction database"; that is, several sequences where patterns that are repeated among them will be searched. Why is this database format so studied? According to my intuition, the format of my problem (a single data sequence) seems more general, am I wrong?

- Is there a way to use the algorithms for transaction databases in "single sequences" (I don't know if this is the technical name)? Or is there a particual algorithm that would be easily adaptable to my case? Or perhaps am I looking at the inappropriate literature?

Thank you very much in advance

Options: ReplyQuote
Re: Mining patterns in a single sequence
Date: December 25, 2020 03:31AM

Enrique Wrote:
-------------------------------------------------------
> Hi all,
>
> I am a researcher in another field (computational
> fluid dynamics), but in a certain project in which
> I participate I need to find repeated patterns in
> a given sequence. My data structure is as follows:
> given a certain alphabet (e.g. {A, B, C, D, E}), I
> have a SINGLE sequence (e.g.
> "ABCCEEADEBCEABDCCEEBBDEAECEBCCEEBDE" ).I want to
> find the closed patterns in this sequence with a
> minimum threshold support (e.g. more than 2
> repetitions), and whose length is greater or equal
> to lMin (e.g. 3) and less than lMax (e.g. 10). The
> desirable result in this example will be "CCEE",
> together with the positions of of the repetitions
> within the sequence.
>
> I have been reading pattern mining literature for
> a couple of weeks (I am a complete novice), until
> I found this library and this forum. I have some
> questions that maybe you can help me to solve:
>

>

>
> Thank you very much in advance


Hi,

Nice to read your message, and to know that you are interested in applying pattern mining in your field.

I will try to provide some answers to your questions.

> - The input format for the algorithms in this
> library is a "transaction database"; that is,
> several sequences where patterns that are repeated
> among them will be searched. Why is this database
> format so studied? According to my intuition, the
> format of my problem (a single data sequence)
> seems more general, am I wrong?

The idea of "transaction database" comes from the application of analyzing customer transactions data. In that context, a transaction database is a set of transactions made by customers. A transaction indicates a set of items purchased by a customer and may be ordered by the time.

Having said that, the concept of transaction database is not restricted to that application. Actually a transaction database is just a list of sets of symbols.

You can think of a sequence as a set of transactions. For example, this sequence:

ABCCEEADEBCEABDCCEEBBDEAECEBCCEEBDE

Can be viewed as:

Transaction 1: A
Transaction 2: B
Transaction 3: C
Transaction 4: C
Transaction 5: E
...

Then you can apply the algorithm for finding patterns in transaction database on your data.

It is to be noted however that not all algorithms are designed to consider the time or the ordering in transaction databases. Most itemset mining or association rule mining algorithms do not care about the order between transactions.

But there are a few algorithms that do! For example, you could check the episode mining algorithms such as TKE in SPMF. The TKE algorithms takes as input a transaction database or we can say a sequence of events (it is the same thing). Then it will find the top-k most frequent episodes in the sequence. An episode is a subsequence that appear many times in the input sequence.

I think that episode mining is the closest to what you want to do.

I see that you have also some restrictions for your application such that perhaps that you want to only find patterns that are contiguous (don't skip some events in your sequence). Perhaps that the TKE or similar algorithm do not have this option but maybe it could be added.

Another possibility is to split your sequence into multiple sequences and then to apply a sequential pattern mining algorithm to find patterns common to multiple sequences. There are some algorithms that have many parameters like to let you control the gap size, the maximal or minimum length of patterns etc. But I think that splitting your sequence is perhaps not a good idea so perhaps that episode mining is the most suitable for your problem.

> - Is there a way to use the algorithms for
> transaction databases in "single sequences" (I
> don't know if this is the technical name)? Or is
> there a particual algorithm that would be easily
> adaptable to my case? Or perhaps am I looking at
> the inappropriate literature?

I think I answered the question above ;-)

Best regards and merry Christmas & happy new year,

Philippe

Options: ReplyQuote


This forum is powered by Phorum and provided by P. Fournier-Viger (© 2012).
Terms of use.