Right approach for problem unclear

The Data Mining Forum

IMPORTANT: This is the old Data Mining forum.
I keep it online so that you can read the old messages.

Please post your new messages in the new forum: https://forum2.philippe-fournier-viger.com/index.php

Goto Topic: Previous•Next

Goto: Forum List•Message List•New Topic•Search•Log In•Print View

Right approach for problem unclear

Posted by: Gilbert

Date: July 08, 2020 12:39PM

Hi,

first of all thank you for the great tool and effort behind it.
I struggle to find the correct approach to tackle the following problem:

I have a sequence of nodes that represent applications/tools that are used by a (human) user to identify characteristics of a "target computer". There are about 250 tools in total possible.

Each sequence shows the past use of the tools, e.g. A->B->C->D (first use tool A, then B...).

Each tool can (or not) yield "valuable" information, meaning add attributes to the target. Example:
After tool A, you get "operating system = Linux", tool B adds "Ubuntu", tool C adds nothing, tool D adds "SSH Version 3.4" etc.
Some tools use the gained attributes as input, others do not.

I want to create a "prediction" / recommendation for which tool to use next based on the previous tool use in the user's current sequence, and a repository of previously observed sequences (tools + gained attributes).

Which approach would be smart here? Frequent pattern mining (rank / compare to antecedent), association rule mining, HMMs? I am slightly overwhelmed with the many potential routes. Thank you!

Options: Reply•Quote

Re: Right approach for problem unclear

Posted by: webmasterphilfv

Date: July 17, 2020 01:12AM

Hi,

Sorry for the long delay to answer. My schedule has been very busy this week. I tried to answer you before but I lost the message and had to write again.

Happy that the software is useful. And welcome to the forum.

I think that there are many algorithms that could be applied and it depends a bit on what you do and also how you prepare the data.

I think that the basic data that you have is some sequence. You could consider the format of sequence database, as used by sequential rule mining and sequential pattern mining algorithm.

In SPMF a sequence is an ordered list of itemsets (sets of items), where items in an itemset are considered to appear simultaneously.

So in your case, a sequence could be like this:

<(Tool1), (Tool2,Windows), (Tool3, SSHPortOPEN, HTTPPortOPEN)>

which means that Tool1 was applied, then Tool2 was applied and we found that the operating system is windows, and then Tool3 is applied and we found that two ports are open.

To encode this in SPMF for sequential pattern mining that sequence would look like this:

1 -1 2 3 -1 4 5 6 -1 -2

where 1 would represent Tool1, 2 would represent Tool2, 3 would represent windows, 4 would represent Tool3, 5 would represent SSHPortOpen and 5 would represent HTTPPortOpen. -1 is a separator and -2 indicates the end of the sequence.

So you could have an input file with many sequence like that and then apply a sequential pattern mining algorithm like TKS to find frequent sequences of tools that peole use:

(tool1)(tool2, windows) for example

Or you could apply a sequential rule mining algorithm to find rules like this:

(tool1) --> (tool2, windows) support: 24 confidence 60 %

Or you could apply the sequence prediction algorithm like CPT+ etc offered in SPMF to predict what is the next tool that someone will use.

or if you think the sequential order is not important, then you could consider also the association rule or itemset mining algorithms... but they dont consider the time or ordering.

Maybe also you can find some other idea.

Hope this gives you some idea!

Best regards

Options: Reply•Quote

Re: Right approach for problem unclear

Posted by: Gilbert

Date: July 22, 2020 09:41AM

Awesome, thank you for the pointers! Once I get some results I will let you know what worked smiling smiley

Options: Reply•Quote

Re: Right approach for problem unclear

Posted by: webmasterphilfv

Date: July 22, 2020 05:23PM

Ok! Good. Hope you get some good results. :-)

Options: Reply•Quote