Re: Is this data too large for mining?
Date: November 04, 2014 06:53AM
Hi,
When these algorithms run out of memory, it is because the search space is too large.
There can be several reasons:
- minsup is set too low,
- If the sequences are long, there are more chances that there will be some very long patterns, and thus that the search space is larger.
- If sequences are very similar, then there will be a lot of patterns to consider.
- If the number of distinct items is small, then there are more chances that sequences will be similar and that there will be more patterns.
In general, there is no maximum length that these algorithms can support. It only depends on your data and its characteristics as said above. The real problem is not the length of sequences, but how many patterns there is and the size of the search space. If there are too many patterns and the search space is too large, algorithms may run out of storage space, become slower and use more memory.
Even with a single sequence:
(a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k)(l)(m)(n)(o)(p)(q)(r)(s)(t)(u)(v) if you set minsup =0, the algorithms could fail to terminate because there would be (2^22)-1 patterns.
To reduce the size of the search space and solve your problem, there are a few solution:
- One of the most simple solution is to add some constraints. For example, you can use the SPAM algorithm to set a maximum pattern length of 4. This means that only patterns having no more than 4 items will be found. Using this constraint and your input file, if i set minsup = 0.5, then I find about 10,000 patterns. I think that the problem is that patterns are too long in your files. And you probably don't need some very long patterns anyway.
- You could use other types of constraints by applying for example the Hirate-Yamana algorithm, which allows to specify gap constraints. In general, the more constraints you apply, less patterns will be found and the faster it will be.
- Alternatively, you could perform some pre-processing on your file to reduce the length of sequences or remove some irrelevant items that you don't care about, or redefine your sequences differently.
- You could run the algorithms using more memory. But in your case, the problem is not really the amoutn of memory. It is that there is too many patterns. So I think that adding constraints is the solution.
Hope that this helps.
Best,
Philippe
Edited 1 time(s). Last edit at 11/04/2014 07:06AM by webmasterphilfv.