Re: algorithms for mining source code tokens
Date: April 12, 2014 04:09AM
Hello,
Time series is for numeric data. Sequence is for symbolic data. I think that in your case it is symbols, so i don't think that algorithms for time series will be appropriate.
If you want to find subsequence common to several sequences (lines), then you may apply a sequential pattern mining algorithm to do that.
If you want to find repeating subsequences in a single sequence, then you may use an episode mining algorithm for example.
There is also several variations of these tasks such as mining sequential rules, etc. However, I don't know any algorithm that will find both subsequences common to several sequences AND also repeating subsequence in a sequence. Perhaps that it exists. But I'm not aware of it.
If you don't care about the ordering of tokens, then you may consider frequent itemset mining which would find sets of tokens that are common to several lines but without considering the order. Another alternative is association rule mining if you want to have some confidence or probability that the items will appear together.