The Data Mining Forum                             open-source data mining software data mining conferences Data Science for Social and Behavioral Analytics DSSBA 2022 data science journal
IMPORTANT: This is the old Data Mining forum.
I keep it online so that you can read the old messages.

Please post your new messages in the new forum: https://forum2.philippe-fournier-viger.com/index.php
 
How can I use context of a question to generate better question from forum corpus?
Posted by: Kunle Isiaka
Date: March 11, 2014 07:39AM

I am a new researcher in text mining. I wish to mine knowledge out of forum corpus. The challenge I am facing are the following and I will very grateful if experienced text miners out there can bail me out.
1. I have a post like this "I wish to come to KL for one week with my kids and a dog. I will like to stay around Chow kit. Can anybody suggest a suitable hotel that we can use?"
2. The question detection algorithm that I used could only extract "Can anybody suggest a suitable hotel that we can use?"
3. This question without the context does not portrait anything. A better form would have been: "Can anybody suggest a pet friendly hotel around Chow kit in KL for me and my kids?"
4. From the literature, similar issues are being handled using tools like i) Conditional Random Fields ii) Latent Dirichlet Allocation and iii) Sequential Pattern Mining
5. These tools are not familiar to me, can anybody suggest how I can get a practically oriented tutorial with test data that will illustrate their usage.

Options: ReplyQuote
Re: How can I use context of a question to generate better question from forum corpus?
Date: March 11, 2014 10:31AM

Hi,

I'm not an expert on text mining so I cannot say exactly what you should use. But I can tell you about what is sequential pattern mining.

Sequential pattern mining is to discover subsequences that appear frequently in a set of sequences.

For example, if you have four sequences:

A B A C G
A B C G
F A G D
E F A F

Then you would have that the sequence A C G has a support (frequency) of 2 because it appears in two sequences.

Similarly, BCG would also have a support of 2, while A G would have a support of 3.

When you do sequential pattern mining, the input is a set of sequences. In your case, it could be sentences. Then, you need to specify a parameter called the "minimum support threshold". Then the algorithm will return all patterns that have a support higher or equal to that threshold.

So for example, if you set the minimum support threshold to 3, you would not discover the pattern B C G because it has a support of 2, but you would discover the pattern A G because it has a support of 3. Note that this is not a full example. Other patterns would also be discovered such as A with a support of 4.

Best,

Philippe



Edited 1 time(s). Last edit at 03/11/2014 10:32AM by webmasterphilfv.

Options: ReplyQuote
Re: How can I use context of a question to generate better question from forum corpus?
Posted by: Kunle Isiaka
Date: March 12, 2014 04:28AM

Thanks Philippe. If I'm to give the sentences to the algorithm, in what format will it be. I have gone through your blog on this subject and I noticed you used a World Cup dataset in a particular format. Pls. could you give tutorial on this format preparation.
Once again I'm grateful for your input.

Options: ReplyQuote
Re: How can I use context of a question to generate better question from forum corpus?
Date: March 12, 2014 05:10AM

If you use my SPMF Java open source data mining library, then you can have a look at the website on the download page to download the source code or JAR file.

http://www.philippe-fournier-viger.com/spmf/index.php?link=download.php

There are some instructions on how to download and install and run the examples (with some example files).

It also explains that the examples are described in the "documentation" page of the website. On this page, the input and output format of each example is explained.



Edited 1 time(s). Last edit at 03/12/2014 05:10AM by webmasterphilfv.

Options: ReplyQuote


This forum is powered by Phorum and provided by P. Fournier-Viger (© 2012).
Terms of use.