Scalability of SPMF

Posted by:
**
Frederic Michaud
**

Date: May 15, 2014 12:33PM

Hi,

I'm currently testing SPMF to analyze data from malware analysis, especially API calls sequences. To start simple, I tried to mine frequent itemsets using the FP-Growth algorithm.

My dataset has 986 transactions, each transaction has between 50 and 1000 items and I have about 40 000 different items. The dataset file is 5.9 MB.

The algorithm has been running for 40 minutes and I wonder if it will eventually terminate. Is my dataset way too big for this algorithm or the SPMF implementation?

Thanks,

Frederic

I'm currently testing SPMF to analyze data from malware analysis, especially API calls sequences. To start simple, I tried to mine frequent itemsets using the FP-Growth algorithm.

My dataset has 986 transactions, each transaction has between 50 and 1000 items and I have about 40 000 different items. The dataset file is 5.9 MB.

The algorithm has been running for 40 minutes and I wonder if it will eventually terminate. Is my dataset way too big for this algorithm or the SPMF implementation?

Thanks,

Frederic

Posted by:
**
webmasterphilfv
**

Date: May 15, 2014 01:09PM

Hi,

In general, the performance of pattern mining algorithms depends on how many patterns there is in the data for some given parameter values. For example, if minsup = 0, there can be billions of patterns for some datasets and the algorithm will most likely never terminate or run out of memory.

The number of patterns to be output (and the size of the search space) depends on:

- how the parameters are set. . In general the number of patterns increase exponentially when minsup is decreased. If you set minup very high, there may be 0 patterns and the algorithm may terminate immediately. If you set it too low, then there can be billions of patterns and it can become extremely slow or run out of memory or even disk space to store the patterns. So usually, people start from a very high value and then gradually decrease the value until enough patterns are found but not too much

- the length of the transactions (1000 is quite large) . If there are longuer transactions, then there may be longer patterns. For example, if you have patterns of length 20 in your data, then up to 2^20-1 subpaterns may also be frequent.

- the number of items also has some influence

- whether the dataset is sparse or dense. If all the transactions are very similar and very long for example, then there will be a huge amount of patterns. If the transactions are very disimillar, then there will be less patterns.

- ...

There is different ways to improve the performance:

- as i said, starting with a high parameter value, and decreasing slowly until there is enough patterns

- maybe perform some pre-processing on the dataset to remove unecessary items that are not useful for the analysis. Sometimes, modifying the input data will greatly improve the performance.

- some people may do sampling by using only a subset of sequences

- adding some constraints. The more constraints are added, the more the search space can be pruned. For example, if a constraint is added on the length of patterns, it will eliminate a lot of patterns. For sequence patter mining, there is one or two algoriths that offer a length constraint for example. For other algorithm, it would not be too hard to add these kind of constraints or other constraints by modifying the source code.

-...

Edited 1 time(s). Last edit at 05/15/2014 01:14PM by webmasterphilfv.

In general, the performance of pattern mining algorithms depends on how many patterns there is in the data for some given parameter values. For example, if minsup = 0, there can be billions of patterns for some datasets and the algorithm will most likely never terminate or run out of memory.

The number of patterns to be output (and the size of the search space) depends on:

- how the parameters are set. . In general the number of patterns increase exponentially when minsup is decreased. If you set minup very high, there may be 0 patterns and the algorithm may terminate immediately. If you set it too low, then there can be billions of patterns and it can become extremely slow or run out of memory or even disk space to store the patterns. So usually, people start from a very high value and then gradually decrease the value until enough patterns are found but not too much

- the length of the transactions (1000 is quite large) . If there are longuer transactions, then there may be longer patterns. For example, if you have patterns of length 20 in your data, then up to 2^20-1 subpaterns may also be frequent.

- the number of items also has some influence

- whether the dataset is sparse or dense. If all the transactions are very similar and very long for example, then there will be a huge amount of patterns. If the transactions are very disimillar, then there will be less patterns.

- ...

There is different ways to improve the performance:

- as i said, starting with a high parameter value, and decreasing slowly until there is enough patterns

- maybe perform some pre-processing on the dataset to remove unecessary items that are not useful for the analysis. Sometimes, modifying the input data will greatly improve the performance.

- some people may do sampling by using only a subset of sequences

- adding some constraints. The more constraints are added, the more the search space can be pruned. For example, if a constraint is added on the length of patterns, it will eliminate a lot of patterns. For sequence patter mining, there is one or two algoriths that offer a length constraint for example. For other algorithm, it would not be too hard to add these kind of constraints or other constraints by modifying the source code.

-...

Edited 1 time(s). Last edit at 05/15/2014 01:14PM by webmasterphilfv.

Posted by:
**
Frederic Michaud
**

Date: May 16, 2014 03:49AM

Thank you Philippe for your excellent explanation. Well, after 3 hours the algorithm did terminate and the result file was 420 GB! SPMF never used more than about 500 MB of memory, which is impressive considering the size of the result file.

The support was set at 0.7, but clearly I will have to preprocess the dataset to remove less interesting items and the biggest transactions.

Next: sequence pattern mining.

Frederic

The support was set at 0.7, but clearly I will have to preprocess the dataset to remove less interesting items and the biggest transactions.

Next: sequence pattern mining.

Frederic