Dear Charles,
Happy new year!
Very interesting topic and problem. And I also see that you have put a lot of efforts and thoughts about this.
Since this data is a very long sequence of itemsets, you could definitely apply an episode mining algorithm to that data. You could apply a frequent episode mining algorithm to find episodes that appear multiple times in your sequence.
In episode mining, if you have data like this:
t0 a c e f g
t1 a c e f g
t2 a d e f h
t3 b d e f g
you could find for example that (ac) followed by (g) has two minimal occurrences. The first one is from t0 to t1. The second one is t1 to t3. Thus, the support of (ac) followed by (g) is 2. This would be the result of some typical algorithm like EMMA or MINEPI+ on the above data. But there are also other variations in episode mining and other way to calculate the support of episodes.
However, a drawback of episode mining algorithm is that most of them use a fixed window length. So if you want to find long episodes, you need to use a large window size. Setting a large window size means that the algorithm may be less able to reduce the search space, but an episode mining algorithm may still use the support to reduce the search space.
If you want to try frequent episode mining on your data, the good news is that I will soon release the source code of these algorithms in SPMF:
-
MINEPI for frequent episode mining
-
MINEPI+ for frequent episode mining
-
EMMA for frequent episode mining
-
HUE-SPAN for high utility episode mining (not relevant for your data)
The code of these algorithms is ready. I just need to integrate them in SPMF, which may take 1 week or two because I have many things to do. But if you want access to the code early of those algorithms only, I should share it with you by e-mail.
That is for my answer to episode mining.
I will think a little about it and maybe add another message below with some more comments on your problem.
Best,
Philippe