The Data Mining Forum                             open-source data mining software data mining conferences Data Science for Social and Behavioral Analytics DSSBA 2022 data science journal
IMPORTANT: This is the old Data Mining forum.
I keep it online so that you can read the old messages.

Please post your new messages in the new forum: https://forum2.philippe-fournier-viger.com/index.php
 
HUI-Miner MapReduce
Posted by: mwk
Date: June 20, 2014 08:15PM

Hello

I need to parallelize HUI-Miner using Hadoop MapReduce, however I am at a complete loss. What would be an efficient way to go about doing this? How to split work among nodes?

Thanks

Options: ReplyQuote
Re: HUI-Miner MapReduce
Date: June 21, 2014 05:35AM

HUIMiner is a depth first search algorithm.

Each path of the depth-first search could be done in parallel. So, how to separate the work could simply be to give different paths of the depth first search to different nodes of the map reduce cluster.

Actually, this could even be done without Map Reduce by just using basic TCP-IP communication between the computers and each computer running the same HUI-Miner but with different paths of the search space. Then all the results are put together.



Edited 1 time(s). Last edit at 06/21/2014 05:35AM by webmasterphilfv.

Options: ReplyQuote
Re: HUI-Miner MapReduce
Posted by: mwk
Date: June 21, 2014 11:29AM

This is what I found as well when looking at some parallel Eclat algorithms.
But would it be a performance increase to distribute work while calculating the TWU and also while creating the utility lists? If so, how can I efficiently parallelize these two database scan steps.

The TCP-IP suggestion is nice but I have to do it with MapReduce.


Thanks Phil

Options: ReplyQuote
Re: HUI-Miner MapReduce
Date: June 21, 2014 02:37PM

The two first database scan are not costly compared to the mining phase that follows it. You just need to read each line once for each scan. So even if you did this step on a single node, I think that it would not be a problem.

But the TWU can be parallelized. TWU of an item is the sum of transaction utilities where the item appear.

So, you could split the database into n smaller databases, then calculate the TWU of each item for each subdatabase. Finally, a node could receive the calculated TWU from each subdatabase and make the sum of the TWU of each subdatabase for each item. This would give the TWU of each item for the whole database.



Edited 4 time(s). Last edit at 06/21/2014 02:40PM by webmasterphilfv.

Options: ReplyQuote


This forum is powered by Phorum and provided by P. Fournier-Viger (© 2012).
Terms of use.