Using CM-Spam for very long sequences
Date: September 10, 2020 02:58AM
I am new to SPMF and am exploring the use of sequential pattern mining for my dataset. I'm finding the tool very useful, but have run into some difficulties with using it for longer sequences
My research uses unstructured data relating to tutor-student interactions. I have labelled the data using a framework, and now have a very long set of sequences, with each sequence relating to one tutoring session. The labels for the data are numerical positive integers, and I have prepared the input file using the instructions on the SPMF website (i.e. every event is separated by a -1, and there is a -2 at the end of every sequence). The sequences consist of a large number of events, rather than item-sets, and I have thus treated each event as its own item-set, and separated each event with a -1.
When I run the CM-SPAM algorthim on a set of 8 sequences with a minsup of 50%, the output file shows a number of patterns with a support of either 3, or 2. I can tell manually that the support should be higher for many of these patterns, and this leads me to think that CM-Spam is not recognizing the bottom 5 sequences for some reason. I'm not sure why this is. The sequences are very long and tend to spill over into multiple lines, so this may be impacting how the algorithm works - could this be a possibility? How could I adjust the files, if so?
Thanks in advance for the help