I have a database table that captures web resource accesses (e.g., web pages) from users' sessions. It is fairly easy to construct sequences of accesses from this table.
I am interested in mining common patterns of web usage from these sequences. Basically, if one represents each web resource as a character, the problem is that of finding commonly occurring substrings.
SPMF offers many algorithms that can be used to identify common sequences based on input sequences of item sets. However, most of these seem to have a couple of drawbacks for the particular problem I'm trying to solve:
- The discovered common sequences ignore mismatches in the original sequences. For example, given two sequences "ABXC" and "AXBC", the algorithms I'm aware of will identify "ABC" as a common sequence. For my work, I want the identified elements of the common sequences to directly follow the one preceding in the input sequences.
- They apply towards item sets of any size. In my case, each item set always consists of a single element. I'm guessing most of these algorithms will suffer a performance penalty by handling the more general case.
Do any of the SPMF algorithms avoid these problems? In particular, I'd like to know of an algorithm that doesn't ignore mismatches in the input sequences.