The Data Mining Forum                             open-source data mining software data mining conferences Data Science for Social and Behavioral Analytics DSSBA 2022 data science journal
IMPORTANT: This is the old Data Mining forum.
I keep it online so that you can read the old messages.

Please post your new messages in the new forum: https://forum2.philippe-fournier-viger.com/index.php
 
How to generate transaction dataset from program source code?
Posted by: xin
Date: October 12, 2011 12:33AM

I want to mine program source code to get info. about programming patterns. Could any one help me how can i transform program source code (C++/C #) into itemset adoptable to apply data mining algorithm (Association mining/sequential pattern mining).

Options: ReplyQuote
Re: How to generate transaction dataset from program source code?
Date: October 12, 2011 03:35AM

Hello Xin,

I think that you first need to determine what is your goal. I mean that you should think about what kind of patterns you want to discover and for which purpose? Setting your goal should help you to choose an appropriate algorithm.

If you decide to use association rule mining, then you need to convert the source code in a transaction database.

If you decide to use sequential pattern mining, then you need to convert the source code to a sequence database.

To do this, in both cases, you will need to determine what the items will represent. For source code, I think that the items could be the name of the methods/functions in the source code. A transaction could be the scope of a method/function. This would allow you to find some patterns about some methods/functions that are often called together. It could then give you some ideas about how to improve your source code maybe. For example, if you detect that a() and b() are often called together, you could decide to motify the source code to merge a() and b().

In my opinion, I think that you should start with a simple approach and then try to add additional stuff.

Hope this helps,

Phil



Edited 1 time(s). Last edit at 10/12/2011 03:37AM by webmasterphilfv.

Options: ReplyQuote
Re: How to generate transaction dataset from program source code?
Posted by: xin
Date: October 12, 2011 07:59PM

Hi Phil
First,i want to find out function use together, variable access correlation. The purpose is to identify the locations where frequently used together elements are missing to identify potential bug locations. in this case the item scope is function and entire function should map to a transaction.
Secondly identifying code clones, the purpose of code clones identification is to identify copy paste code related bugs and code optimization. in this case the code scope is basic block which make sequence of consecutive statements. hence i can find multiple sequences to apply sequential pattern mining.

The problem in both cases is how i should map functions / basic blocks to transaction set/ sequences. cause both algorithms took inputs in some numbers and how i should transform a function to meaningful number. if you can give an example by mapping a small function that would be wonderful.

Thanks

Options: ReplyQuote
Re: How to generate transaction dataset from program source code?
Date: October 14, 2011 03:04AM

Hello Xin,

I think that you can just assign a unique number for each function names.

For example if you have some code like this:


void functionA(){
int a, b = 5;
int c = 6;

functionA(a);
functionB()


functionC()
functionD()
}

You could convert this to a transaction like this:

1 2 3 4

where : 1 = functionA, 2 = functionB, 3 = functionC, 4 = functionD

You can create the mapping between numbers and functions dynamically. The number is not important. What is important is that you use unique number for each function name.

For sequences, you could do something similar. For example, consider this function:

void functionA(){
int a, b = 5;
int c = 6;

functionA(a);
functionB()


functionC()
functionD()
functionA()
functionC()
}

This could be translated as a sequence such as:

1 , 2 , 3, 4, 1, 3

where 1 = functionA, 2= functionB, 3 = functionC, 4= functionD


For sequences, you also use the basic blocks of function to group some items together. For example, consider this function:

void functionA(){
int a, b = 5;
int c = 6;

for(....) {
functionA(a);
functionB()
}


functionC()
functionD()

while(...){
functionA()
functionC()
}
}


This could be translated as a sequence such as:

(1 2 ), 3, 4, (1 3)

where 1 = functionA, 2= functionB, 3 = functionC, 4= functionD


I just write it like this to give you some examples. But if you use the SPMF source code on my website for sequential pattern mining, the input format for sequences would be:

1 2 -1 3 -1 4 -1 1 3 -1 -2

Hope this helps,

Phil

Options: ReplyQuote
Re: How to generate transaction dataset from program source code?
Posted by: xin
Date: October 14, 2011 10:16PM

Hello phil,

Thanks its helps me a lot.

Options: ReplyQuote


This forum is powered by Phorum and provided by P. Fournier-Viger (© 2012).
Terms of use.