The Data Mining Forum                             open-source data mining software data mining conferences Data Science for Social and Behavioral Analytics DSSBA 2022 data science journal
IMPORTANT: This is the old Data Mining forum.
I keep it online so that you can read the old messages.

Please post your new messages in the new forum: https://forum2.philippe-fournier-viger.com/index.php
 
Interestingness of patterns.
Posted by: Deniz
Date: June 11, 2021 06:12AM

Hi Prof,

A plethora of methods is presented to recover different patterns ( frequent, sequential, closed, maximal patterns).

The question is the whole set of the generated patterns is interesting or not?

We should care more about the interestingness of the generated patterns. Of course, ARM is used but it is not optimal since it is general.

It would be appreciated if you elaborate on how we can assess these patterns as interesting or not? For example, in the health sector domain, a huge amount of patterns are generated. How does a researcher assess the generated patterns to say this pattern is well-known (uninteresting), interesting ( as it is unexpected, not known before), or just noise?

regards,

Options: ReplyQuote
Re: Interestingness of patterns.
Date: June 11, 2021 07:44AM

Good evening,

Yes, there are indeed many pattern types (itemset, sequential patterns, episodes, etc.), and also many different measures (support, utility etc.) that can be used to select patterns.

How to know if a pattern is interesting?

Some criterias can be subjective (more like an opinion). For example, I discover that {bread,milk} is purchased by many customers but I know this already so for me it is not interesting because it is not novel.

Some other criterias are more objective. For example, in top-k high utility itemset mining, we want to find the k patterns that make the most profit. The money is something that can directly make sense for a company.

But, for several pattern mining task or measures, whether a pattern is interesting or not is less clear. For example, in frequent itemset mining, we aim to find patterns that appear frequently in the data... But such patterns even if they are frequent they may still be just the result of chance. For example, if everyone buys bread in a store, I could find a pattern {bread,computer} but that would be a spurious patterns because it just appear together many times by chance because everyone buys bread... not because of a special relationships between these values.

So to address this kind of problem, an interesting research direction is to find "correlated patterns" or "statistically significant patterns". For example, in correlated itemset mining, we look for patterns where the items have a strong correlation together using some measures such as the bond.
And to find statistically significant patterns, we use statistical test in pattern mining. So the goal is to find patterns that represent something statistically significant... This is interesting for domains such as medical data... You could find a pattern like {drink_water, cancer} which is highly frequent but the statistical significance that it appears more than just by chance would be likely very low... so it would be discarded.

In my opinion, using statistical tests is a very promising approach.. but the problem is that it makes the algorithms much more complicated.

Besides all of that, if we want to check if a pattern is interesting, we can ask a domain expert. For example, in a recent paper that I published albout alarm correlation rules, we asked some telecommunication expert to validate the rules that we found in a computer network. The expert could validate the rules as good or not and also indicate if they were unknown or already known. This is an interesting approach to evaluate a new pattern type.

Besides that, another way to evaluate patterns is to try to use them for some tasks. For example some papers will present a new pattern type and show that they are useful for tasks such as sequence prediction, clustering etc.

Hope that this helps

Best regards

Options: ReplyQuote
Re: Interestingness of patterns.
Posted by: Deniz
Date: June 13, 2021 10:54PM

Thank you so much,

It is really helpful.

It would be appreciated if you could please elaborate on the following.
"Using statistical tests in pattern mining. So the goal is to find patterns that represent something statistically significant... This is interesting for domains such as medical data... You could find a pattern like {drink_water, cancer} which is highly frequent but the statistical significance that it appears more than just by chance would be likely very low... so it would be discarded."

What are the statistical tests that can be utilized in the medical data?
and how we can use them.

Kind regards,
Deniz

Options: ReplyQuote
Re: Interestingness of patterns.
Date: June 19, 2021 08:48AM

Dear Deniz,

About the statistical testing in pattern mining, there are some papers about this. The main idea is the following. Lets say that we want to find frequent patterns in the data. A pattern could be : {drink_tea, liver_cancer, drink_alcool}

But this pattern maybe just appear frequently by chance, not really because there is a strong correlation between these items.

As a solution, some papers like the one of Webb about productive itemsets will use the Fisher exact-test to verify if the observed frequency of {drink_tea, liver_cancer, drink_alcool} is significantly different than if its subsets like
{drink_tea, liver_cancer}, {liver_cancer, drink_alcool} and {drink_alcool, drin_tea} were independent.

If there is no significant difference, then we can assume that {drink_tea, liver_cancer, drink_alcool} is not something special.. maybe it is just happening by chance.

But if there is a significant difference, then it means that there is probably something special. This itemset is appearing more often that it would be expected!

That is the main idea about this. That approach was started by Geoff Webb to my knowledge. And I have also done some papers on this topic:


Xiang Li, Jiaxuan Li, Philippe Fournier-Viger, M. Saqib Nawaz, Jie Yao, Jerry Chun-Wei Lin:
Mining Productive Itemsets in Dynamic Databases. IEEE Access 8: 140122-140144 (2020)

The algorithm to find self-sufficient itemset of Webb is implemented in SPMF. It is called Opus-Miner.

In recent years, there are also some other algorithms that have been proposed which may use different approach for statistical tests but I have not read them.

Best regards,

Options: ReplyQuote


This forum is powered by Phorum and provided by P. Fournier-Viger (© 2012).
Terms of use.