HUI-Miner MapReduce

The Data Mining Forum

IMPORTANT: This is the old Data Mining forum.
I keep it online so that you can read the old messages.

Please post your new messages in the new forum: https://forum2.philippe-fournier-viger.com/index.php

Goto Topic: Previous•Next

Goto: Forum List•Message List•New Topic•Search•Log In•Print View

HUI-Miner MapReduce

Posted by: mwk

Date: June 20, 2014 08:15PM

Hello

I need to parallelize HUI-Miner using Hadoop MapReduce, however I am at a complete loss. What would be an efficient way to go about doing this? How to split work among nodes?

Thanks

Options: Reply•Quote

Re: HUI-Miner MapReduce

Posted by: webmasterphilfv

Date: June 21, 2014 05:35AM

HUIMiner is a depth first search algorithm.

Each path of the depth-first search could be done in parallel. So, how to separate the work could simply be to give different paths of the depth first search to different nodes of the map reduce cluster.

Actually, this could even be done without Map Reduce by just using basic TCP-IP communication between the computers and each computer running the same HUI-Miner but with different paths of the search space. Then all the results are put together.

Edited 1 time(s). Last edit at 06/21/2014 05:35AM by webmasterphilfv.

Options: Reply•Quote

Re: HUI-Miner MapReduce

Posted by: mwk

Date: June 21, 2014 11:29AM

This is what I found as well when looking at some parallel Eclat algorithms.
But would it be a performance increase to distribute work while calculating the TWU and also while creating the utility lists? If so, how can I efficiently parallelize these two database scan steps.

The TCP-IP suggestion is nice but I have to do it with MapReduce.

Thanks Phil

Options: Reply•Quote

Re: HUI-Miner MapReduce

Posted by: webmasterphilfv

Date: June 21, 2014 02:37PM

The two first database scan are not costly compared to the mining phase that follows it. You just need to read each line once for each scan. So even if you did this step on a single node, I think that it would not be a problem.

But the TWU can be parallelized. TWU of an item is the sum of transaction utilities where the item appear.

So, you could split the database into n smaller databases, then calculate the TWU of each item for each subdatabase. Finally, a node could receive the calculated TWU from each subdatabase and make the sum of the TWU of each subdatabase for each item. This would give the TWU of each item for the whole database.

Edited 4 time(s). Last edit at 06/21/2014 02:40PM by webmasterphilfv.

Options: Reply•Quote