data format

The Data Mining Forum

IMPORTANT: This is the old Data Mining forum.
I keep it online so that you can read the old messages.

Please post your new messages in the new forum: https://forum2.philippe-fournier-viger.com/index.php

Goto Topic: Previous•Next

Goto: Forum List•Message List•New Topic•Search•Log In•Print View

data format

Posted by: Ron

Date: October 20, 2012 06:41PM

Hi All,

The data format in the examples that have parens, like this:

S1 (1), (1 2 3), (1 3), (4), (3 6)
S2 (1 4), (3), (2 3), (1 5)
S3 (5 6), (1 2), (4 6), (3), (2)
S4 (5), (7), (1 6), (3), (2), (3)

Don't appear to match on the the actual files that come with the source
code.

Why are they different? Where is the description of the file formats?

Regards,

Ron

Options: Reply•Quote

Re: data format

Posted by: webmasterphilfv

Date: October 20, 2012 06:50PM

Hi Ron,

Welcome to the forum. The reason why I don't use the same syntax on the documentation page is to make it more readable.

The actual file format is very similar. It is just that I use -1 and -2 as separators between items/itemsets instead of parenthesis.

Here is the full explanation of the file format for sequence databases:
- each line represent a sequence (a list of itemset)
- an itemset is a list of item separated by spaces.
- an item is represented by an integer (1, 2....)
- each itemset is separated by a -1.
- each sequence ends by -2.

So for example, the sequence (1), (1 2 3), (1 3), (4), (3 6) would be represented as follow in the file:

1 -1 1 2 3 -1 1 3 -1 4 -1 3 6 -1 -2

The meaning of the sequence (1), (1 2 3), (1 3), (4), (3 6) is that the item 1 occured, folowed by item 1, 2 and 3 at the same time, followed by 1 and 3 at the same time, followed by 4, followed by 3 and 6 at the same time.

From your question, I see that you are either interested in sequential pattern mining or sequential rule mining. In the SPMF project there is some algorithm that take different inputs. But it is generally easy to see how it works by comparing the documentation with the example files. In any case, if you have any further questions about this, please let me know!

Hope this helps,

Best,

Philippe

Edited 6 time(s). Last edit at 10/20/2012 06:58PM by webmasterphilfv.

Options: Reply•Quote

Re: data format

Posted by: Ron

Date: October 21, 2012 06:53AM

Thanks Philippe,

It's very hard to read the -1 -2 format, and understand the fine documentation that you've created. If a person wants to make their own datasets, what's the best way to do it? How hard would it be to make a parser or translator to convert from parens to the -1 -2 format? If we could do that, it would be a big help, right?

Regards,

Ron

Options: Reply•Quote

Re: data format

Posted by: webmasterphilfv

Date: October 22, 2012 05:13AM

Hi Ron,

Thanks for the suggestion.

I wrote some code for you. If you want to use a parenthesis format, you can replace the method loadFile() in SequenceDatabase.Java for the algorithm that you are using, by this:

public void loadFileParenthesisFormat(String path) throws IOException {
		String thisLine;
		BufferedReader myInput = null;
		try {
			FileInputStream fin = new FileInputStream(new File(path));
			myInput = new BufferedReader(new InputStreamReader(fin));
			int seqID = 0;
			while ((thisLine = myInput.readLine()) != null) {
				// si la ligne n'est pas un commentaire
				if(thisLine.charAt(0) != '#'){ 
					Sequence sequence = new Sequence(seqID++);
					Itemset itemset = null;
					String split[] = thisLine.split(" "  )  ;
					for(String itemString : split ) {
						int start =0; 
						int end = 0;
						if(itemString.charAt(0) == '('){
							itemset = new Itemset();
							start = 1;
						}
						
						if(itemString.charAt(itemString.length()-1) == ')'){
							sequence.addItemset(itemset) ;
							end = itemString.length()-1;
						}else{
							end = itemString.length();
						}
						Integer item = Integer.parseInt(itemString.substring(start, end) ) ;
						itemset.addItem(item ) ;
					}
					sequences.add(sequence);
				}	
			}
		} catch (Exception e) {
			e.printStackTrace();
		}finally {
			if(myInput != null){
				myInput.close();
			}
	    }
	}

This method will allow the algorithm to read files according to this format:

(1) (1 2 3) (1 3) (4) (3 6)
(1 4) (3) (2 3) (1 5)
(5 6) (1 2) (4 6) (3) (2)
(5) (7) (1 6) (3) (2) (3)

Besides, for now , I will keep the -1 -2 format as the default SPMF format for now, to assure compatibility with previous versions and also because some other algorithms also use this format. But I will consider changing it in future versions. Thanks for the feedback!

Best,

Philippe

Edited 4 time(s). Last edit at 10/22/2012 05:17AM by webmasterphilfv.

Options: Reply•Quote

Re: data format

Posted by: Ron

Date: October 22, 2012 02:26PM

Thank you!

Options: Reply•Quote