Re: Scalability of SPMF
Date: May 15, 2014 01:09PM
Hi,
In general, the performance of pattern mining algorithms depends on how many patterns there is in the data for some given parameter values. For example, if minsup = 0, there can be billions of patterns for some datasets and the algorithm will most likely never terminate or run out of memory.
The number of patterns to be output (and the size of the search space) depends on:
- how the parameters are set. . In general the number of patterns increase exponentially when minsup is decreased. If you set minup very high, there may be 0 patterns and the algorithm may terminate immediately. If you set it too low, then there can be billions of patterns and it can become extremely slow or run out of memory or even disk space to store the patterns. So usually, people start from a very high value and then gradually decrease the value until enough patterns are found but not too much
- the length of the transactions (1000 is quite large) . If there are longuer transactions, then there may be longer patterns. For example, if you have patterns of length 20 in your data, then up to 2^20-1 subpaterns may also be frequent.
- the number of items also has some influence
- whether the dataset is sparse or dense. If all the transactions are very similar and very long for example, then there will be a huge amount of patterns. If the transactions are very disimillar, then there will be less patterns.
- ...
There is different ways to improve the performance:
- as i said, starting with a high parameter value, and decreasing slowly until there is enough patterns
- maybe perform some pre-processing on the dataset to remove unecessary items that are not useful for the analysis. Sometimes, modifying the input data will greatly improve the performance.
- some people may do sampling by using only a subset of sequences
- adding some constraints. The more constraints are added, the more the search space can be pruned. For example, if a constraint is added on the length of patterns, it will eliminate a lot of patterns. For sequence patter mining, there is one or two algoriths that offer a length constraint for example. For other algorithm, it would not be too hard to add these kind of constraints or other constraints by modifying the source code.
-...
Edited 1 time(s). Last edit at 05/15/2014 01:14PM by webmasterphilfv.