The Data Mining Forum                             open-source data mining software data mining conferences Data Science for Social and Behavioral Analytics DSSBA 2022 data science journal
IMPORTANT: This is the old Data Mining forum.
I keep it online so that you can read the old messages.

Please post your new messages in the new forum: https://forum2.philippe-fournier-viger.com/index.php
 
data format
Posted by: Ron
Date: October 20, 2012 06:41PM

Hi All,

The data format in the examples that have parens, like this:

S1 (1), (1 2 3), (1 3), (4), (3 6)
S2 (1 4), (3), (2 3), (1 5)
S3 (5 6), (1 2), (4 6), (3), (2)
S4 (5), (7), (1 6), (3), (2), (3)

Don't appear to match on the the actual files that come with the source
code.

Why are they different? Where is the description of the file formats?

Regards,

Ron

Options: ReplyQuote
Re: data format
Date: October 20, 2012 06:50PM

Hi Ron,

Welcome to the forum. The reason why I don't use the same syntax on the documentation page is to make it more readable.

The actual file format is very similar. It is just that I use -1 and -2 as separators between items/itemsets instead of parenthesis.

Here is the full explanation of the file format for sequence databases:
- each line represent a sequence (a list of itemset)
- an itemset is a list of item separated by spaces.
- an item is represented by an integer (1, 2....)
- each itemset is separated by a -1.
- each sequence ends by -2.

So for example, the sequence (1), (1 2 3), (1 3), (4), (3 6) would be represented as follow in the file:

1 -1 1 2 3 -1 1 3 -1 4 -1 3 6 -1 -2

The meaning of the sequence (1), (1 2 3), (1 3), (4), (3 6) is that the item 1 occured, folowed by item 1, 2 and 3 at the same time, followed by 1 and 3 at the same time, followed by 4, followed by 3 and 6 at the same time.

From your question, I see that you are either interested in sequential pattern mining or sequential rule mining. In the SPMF project there is some algorithm that take different inputs. But it is generally easy to see how it works by comparing the documentation with the example files. In any case, if you have any further questions about this, please let me know!

Hope this helps,

Best,

Philippe



Edited 6 time(s). Last edit at 10/20/2012 06:58PM by webmasterphilfv.

Options: ReplyQuote
Re: data format
Posted by: Ron
Date: October 21, 2012 06:53AM

Thanks Philippe,

It's very hard to read the -1 -2 format, and understand the fine documentation that you've created. If a person wants to make their own datasets, what's the best way to do it? How hard would it be to make a parser or translator to convert from parens to the -1 -2 format? If we could do that, it would be a big help, right?


Regards,

Ron

Options: ReplyQuote
Re: data format
Date: October 22, 2012 05:13AM

Hi Ron,

Thanks for the suggestion.

I wrote some code for you. If you want to use a parenthesis format, you can replace the method loadFile() in SequenceDatabase.Java for the algorithm that you are using, by this:

public void loadFileParenthesisFormat(String path) throws IOException {
		String thisLine;
		BufferedReader myInput = null;
		try {
			FileInputStream fin = new FileInputStream(new File(path));
			myInput = new BufferedReader(new InputStreamReader(fin));
			int seqID = 0;
			while ((thisLine = myInput.readLine()) != null) {
				// si la ligne n'est pas un commentaire
				if(thisLine.charAt(0) != '#'){ 
					Sequence sequence = new Sequence(seqID++);
					Itemset itemset = null;
					String split[] = thisLine.split(" "  )  ;
					for(String itemString : split ) {
						int start =0; 
						int end = 0;
						if(itemString.charAt(0) == '('){
							itemset = new Itemset();
							start = 1;
						}
						
						if(itemString.charAt(itemString.length()-1) == ')'){
							sequence.addItemset(itemset) ;
							end = itemString.length()-1;
						}else{
							end = itemString.length();
						}
						Integer item = Integer.parseInt(itemString.substring(start, end) ) ;
						itemset.addItem(item ) ;
					}
					sequences.add(sequence);
				}	
			}
		} catch (Exception e) {
			e.printStackTrace();
		}finally {
			if(myInput != null){
				myInput.close();
			}
	    }
	}

This method will allow the algorithm to read files according to this format:


(1) (1 2 3) (1 3) (4) (3 6)
(1 4) (3) (2 3) (1 5)
(5 6) (1 2) (4 6) (3) (2)
(5) (7) (1 6) (3) (2) (3)

Besides, for now , I will keep the -1 -2 format as the default SPMF format for now, to assure compatibility with previous versions and also because some other algorithms also use this format. But I will consider changing it in future versions. Thanks for the feedback!

Best,

Philippe



Edited 4 time(s). Last edit at 10/22/2012 05:17AM by webmasterphilfv.

Options: ReplyQuote
Re: data format
Posted by: Ron
Date: October 22, 2012 02:26PM

Thank you!

Options: ReplyQuote


This forum is powered by Phorum and provided by P. Fournier-Viger (© 2012).
Terms of use.