The Data Mining Forum                             open-source data mining software data mining conferences Data Science for Social and Behavioral Analytics DSSBA 2022 data science journal
IMPORTANT: This is the old Data Mining forum.
I keep it online so that you can read the old messages.

Please post your new messages in the new forum: https://forum2.philippe-fournier-viger.com/index.php
 
Is this data too large for mining?
Posted by: doubleL
Date: November 04, 2014 01:42AM

Hi everybody, I have a sequencial database converted from a smarthome database. Each item represents the state of sensor. The file is here

https://www.dropbox.com/s/6rc0ca6fe5pc8dg/check2.txt?dl=0

I tried with sever algorithms, like BIDE, CloSpan... But when I set the support less than 95%, those algoritms just run for hours and then give me a "java.lang.OutOfMemoryError: GC overhead limit exceeded".
It seems like each sequence is too long for the mining algorithms. if that, what is the ideal length of the sequences?

Options: ReplyQuote
Re: Is this data too large for mining?
Date: November 04, 2014 06:53AM

Hi,

When these algorithms run out of memory, it is because the search space is too large.

There can be several reasons:
- minsup is set too low,
- If the sequences are long, there are more chances that there will be some very long patterns, and thus that the search space is larger.
- If sequences are very similar, then there will be a lot of patterns to consider.
- If the number of distinct items is small, then there are more chances that sequences will be similar and that there will be more patterns.

In general, there is no maximum length that these algorithms can support. It only depends on your data and its characteristics as said above. The real problem is not the length of sequences, but how many patterns there is and the size of the search space. If there are too many patterns and the search space is too large, algorithms may run out of storage space, become slower and use more memory.

Even with a single sequence:

(a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k)(l)(m)(n)(o)(p)(q)(r)(s)(t)(u)(v) if you set minsup =0, the algorithms could fail to terminate because there would be (2^22)-1 patterns.

To reduce the size of the search space and solve your problem, there are a few solution:
- One of the most simple solution is to add some constraints. For example, you can use the SPAM algorithm to set a maximum pattern length of 4. This means that only patterns having no more than 4 items will be found. Using this constraint and your input file, if i set minsup = 0.5, then I find about 10,000 patterns. I think that the problem is that patterns are too long in your files. And you probably don't need some very long patterns anyway.
- You could use other types of constraints by applying for example the Hirate-Yamana algorithm, which allows to specify gap constraints. In general, the more constraints you apply, less patterns will be found and the faster it will be.
- Alternatively, you could perform some pre-processing on your file to reduce the length of sequences or remove some irrelevant items that you don't care about, or redefine your sequences differently.
- You could run the algorithms using more memory. But in your case, the problem is not really the amoutn of memory. It is that there is too many patterns. So I think that adding constraints is the solution.

Hope that this helps.

Best,

Philippe



Edited 1 time(s). Last edit at 11/04/2014 07:06AM by webmasterphilfv.

Options: ReplyQuote
Re: Is this data too large for mining?
Posted by: doubleL
Date: November 04, 2014 07:00AM

Hi Philippe, thanks for your very helpfull reply, I will try your suggestions and update if I get some new results

Options: ReplyQuote
Re: Is this data too large for mining?
Date: November 04, 2014 07:07AM

Hi you are welcome. Just let you know that I have updated the message above to make it more clear.

Best.

Options: ReplyQuote


This forum is powered by Phorum and provided by P. Fournier-Viger (© 2012).
Terms of use.