UCI repository datasets transformation
Posted by: user
Date: March 10, 2022 01:07AM

Dear Sir,

First, thanks for your informative website for the data mining community.
I would like to ask you about the best method to transform datasets from UCI to be usable for FIM methods.

Kind regards,

Re: UCI repository datasets transformation
Date: March 10, 2022 04:04PM


Glad the website is useful and thanks for posting

On UCI there are many datasets. Many of them have different formats. So the best way to convert datasets depends on the format.

To convert a dataset for FIM, you need to think what will be the transactions and what will be the items. For example, if in a dataset the data is numerical, then you may have to discretize it to obtain items. It all depends what you want to do.

As for converting the dataset, I think the best way is to write a small program that read the file and write another file. This just requires simple programming skills (read/write files). When I convert a dataset, I write a program in Java but you could use any language like Python etc.

Hope this helps


Re: UCI repository datasets transformation
Posted by: user
Date: March 23, 2022 05:57AM

I really appreciate your kind response.

You are right, it is easy to write and read. I would appreciate it if you would share your Java program as an example for converting UCI to FIM for a specific dataset. Your example will encourage many, including me, to convert more datasets.

Best regards,

Re: UCI repository datasets transformation
Date: March 23, 2022 10:01AM

Good evening,

Here is some piece of Java code that I use for converting a CSV file into the another format

The input is like this:


The output is like this:

1 2 3 4
5 6 7 8
5 6 7
1 2 3

The Java code:

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.Collections;
import java.util.HashSet;
import java.util.List;
import java.util.Set;

public class Example {

	public static void main(String[] args) throws IOException {
		BufferedReader myInput = null;
		try {
			String input = "test.csv";
			String output = "output.txt";
			// we create an object for writing the output file
			BufferedWriter writer = new BufferedWriter(new FileWriter(output)); 
			// Objects to read the file
			FileInputStream fin = new FileInputStream(new File(input));
			myInput = new BufferedReader(new InputStreamReader(fin));
			int count = 0; // to count the number of line

			String thisLine; // variable to read a line
			// we read the file line by line until the end of the file
			while ((thisLine = myInput.readLine()) != null) {
				// if not the first line, we create a new line
				if(count !=0){
					writer.newLine(); // create new line
				// we split the line according to spaces
				String[] split = thisLine.split(","winking smiley;
				// we use a set to store the values to avoid duplicates
				// because they are not allowed in a transaction
				Set<Integer> values = new HashSet<Integer>();
				for(int i=0; i< split.length; i++){
					values.add(Integer.parseInt(split ) )  ;
				// sort the transaction in lexical order
				List<Integer> listValues = new ArrayList<Integer>(values);
				// for each item, we will output them
				for (int i=0; i<listValues.size(); i++) {
					if(i != listValues.size() -1){
						// if not the last item
						// write the item with an itemset separator
						writer.write(listValues.get(i) + " "  )  ;   
						// if the last item
						// write the item
						writer.write(listValues.get(i) + "" )  ;   
				count++; // increase the number of line
			// close the output file
		} catch (Exception e) {
		} finally {
			if (myInput != null) {

Basically, I create two objects: one object for reading a file, and one object for writing a file. Then I read line by line the inpput file, and split into tokens according to ",". Then, I write the lines in the output file in the other format.

Depending on the input format, it could be more complicated than this. But here, it is very simple.
Hope that this helps

Best regards,


Edited 7 time(s). Last edit at 03/23/2022 10:13AM by webmasterphilfv.

This forum is powered by Phorum and provided by P. Fournier-Viger (© 2012).
Terms of use.