IMPORTANT: This is the old Data Mining forum.

I keep it online so that you can read the old messages.

Please post your new messages in the

algorithms for mining source code tokens

Posted by:
**
Jerry Li
**

Date: April 10, 2014 06:23PM

Dear All,

I have a problem to find repeating token sets from a txt file which consists of token sequence list. For each line (each token sequence) it consists of tokenized tokens of a single source code file. The whole txt file combines all the tokens of source code files of a big system together such as Linux.

Is there any algorithm appropriate to solve this problem? It would be highly appreciated if you can send me email or reply in this forum.

I have a problem to find repeating token sets from a txt file which consists of token sequence list. For each line (each token sequence) it consists of tokenized tokens of a single source code file. The whole txt file combines all the tokens of source code files of a big system together such as Linux.

Is there any algorithm appropriate to solve this problem? It would be highly appreciated if you can send me email or reply in this forum.

Posted by:
**
webmasterphilfv
**

Date: April 11, 2014 09:28AM

So if I understand well, you have a big text files where each line is a list of tokens.

Then you want to find some lists of tokens that appear in several lines? or that appear several times in the same line?

Does the order of tokens is important or not?

Edited 1 time(s). Last edit at 04/11/2014 09:28AM by webmasterphilfv.

Then you want to find some lists of tokens that appear in several lines? or that appear several times in the same line?

Does the order of tokens is important or not?

Edited 1 time(s). Last edit at 04/11/2014 09:28AM by webmasterphilfv.

Posted by:
**
Jerry Li
**

Date: April 11, 2014 08:29PM

Dear Sir, Thank you very much for your reply.

Your are right, each line is a token sequence where the tokens come from a source code file tokenized by the parser.

I want to find all the repeating sub-sequence (consists of token list) from both, in the same line and in several lines.

I think the order is important, but not sure. I'm a little confused that should I use frequent sequence mining or time series mining algorithms or some other algorithms.

Looking forward to your reply. Thank you very much.

Your are right, each line is a token sequence where the tokens come from a source code file tokenized by the parser.

I want to find all the repeating sub-sequence (consists of token list) from both, in the same line and in several lines.

I think the order is important, but not sure. I'm a little confused that should I use frequent sequence mining or time series mining algorithms or some other algorithms.

Looking forward to your reply. Thank you very much.

Posted by:
**
webmasterphilfv
**

Date: April 12, 2014 04:09AM

Hello,

Time series is for numeric data. Sequence is for symbolic data. I think that in your case it is symbols, so i don't think that algorithms for time series will be appropriate.

If you want to find subsequence common to several sequences (lines), then you may apply a sequential pattern mining algorithm to do that.

If you want to find repeating subsequences in a single sequence, then you may use an episode mining algorithm for example.

There is also several variations of these tasks such as mining sequential rules, etc. However, I don't know any algorithm that will find both subsequences common to several sequences AND also repeating subsequence in a sequence. Perhaps that it exists. But I'm not aware of it.

If you don't care about the ordering of tokens, then you may consider frequent itemset mining which would find sets of tokens that are common to several lines but without considering the order. Another alternative is association rule mining if you want to have some confidence or probability that the items will appear together.

Time series is for numeric data. Sequence is for symbolic data. I think that in your case it is symbols, so i don't think that algorithms for time series will be appropriate.

If you want to find subsequence common to several sequences (lines), then you may apply a sequential pattern mining algorithm to do that.

If you want to find repeating subsequences in a single sequence, then you may use an episode mining algorithm for example.

There is also several variations of these tasks such as mining sequential rules, etc. However, I don't know any algorithm that will find both subsequences common to several sequences AND also repeating subsequence in a sequence. Perhaps that it exists. But I'm not aware of it.

If you don't care about the ordering of tokens, then you may consider frequent itemset mining which would find sets of tokens that are common to several lines but without considering the order. Another alternative is association rule mining if you want to have some confidence or probability that the items will appear together.

Posted by:
**
Jerry Li
**

Date: April 14, 2014 05:08AM

Dear Prof. Fournier-Viger,

Thank you so much for your time to reply me.

Actually I have written a Java program to transform the string tokens into numeric form according to the rule I predefined. I think algorithms for time series will be a way to solve my problem. Right now I want to find some algorithms that could be used to find frequent sub-sequence from these two kinds of token sequence separately.

Sequential pattern mining is another way to solve my problem. But I studied the algorithms listed in your website, almost none of them such as PrefixSpan, GSP,Spade etc. cannot be used for my paper because first, I realized the order of the items in subsequence(itemsets) is very important as the items are tokenized source code. Second, subsequence(itemsets) may appear more than once within the same sequence(line or transaction).

I have read two papers that may help me. One is "Discovering Frequent Patterns from Strings" written by Jaak Vilo. This paper provides a way to find frequent sub-strings from strings. Another one is "Periodicity Data Mining in Time Series Using Suffix Arrays" written by Konstantinos etc. This paper provides a way to find repeating patterns in time series. But there are a lot of difficulties when I try to implement the algorithms of these two papers in Java.

So do you have existing algorithms that you think they are suitable to solve the problem of my paper? May I send you the txt files of both string and numerical form tokens for your consideration through email?

Thank you again for your patience and your time.

Thank you so much for your time to reply me.

Actually I have written a Java program to transform the string tokens into numeric form according to the rule I predefined. I think algorithms for time series will be a way to solve my problem. Right now I want to find some algorithms that could be used to find frequent sub-sequence from these two kinds of token sequence separately.

Sequential pattern mining is another way to solve my problem. But I studied the algorithms listed in your website, almost none of them such as PrefixSpan, GSP,Spade etc. cannot be used for my paper because first, I realized the order of the items in subsequence(itemsets) is very important as the items are tokenized source code. Second, subsequence(itemsets) may appear more than once within the same sequence(line or transaction).

I have read two papers that may help me. One is "Discovering Frequent Patterns from Strings" written by Jaak Vilo. This paper provides a way to find frequent sub-strings from strings. Another one is "Periodicity Data Mining in Time Series Using Suffix Arrays" written by Konstantinos etc. This paper provides a way to find repeating patterns in time series. But there are a lot of difficulties when I try to implement the algorithms of these two papers in Java.

So do you have existing algorithms that you think they are suitable to solve the problem of my paper? May I send you the txt files of both string and numerical form tokens for your consideration through email?

Thank you again for your patience and your time.