The Data Mining Forum                             open-source data mining software data mining conferences Data Science for Social and Behavioral Analytics DSSBA 2022 data science journal
IMPORTANT: This is the old Data Mining forum.
I keep it online so that you can read the old messages.

Please post your new messages in the new forum: https://forum2.philippe-fournier-viger.com/index.php
 
SPAM vs PrefixSpan
Posted by: Dvijesh88
Date: April 28, 2012 10:08AM

I just want to know which algorithm better? please tell me with all points.
where to use these algorithms?

Options: ReplyQuote
Re: SPAM vs PrefixSpan
Date: April 28, 2012 02:55PM

Hello Dvijesh,

Here are my quick observations about SPAM and PrefixSpan. Maybe I forgot some elements.

What is good about SPAM:
- it uses a bitmap representation which is memory efficient. There is several optimizations that are posible with bitmaps.
- it uses a vertical representation of the database so that the database only need to be scanned once to create the vertical representation.
- it is very fast to calculate the intersection of two sids sets (sets of sequence ids) by doing a kind of logical AND with two bitmaps.
-. ...

The weak points of SPAM:
- if the sequences are very long, the memory usage will go up because each bitmap take more memory space. If there is a lot of frequent items, more bitmaps will need to be stored into memory.
- SPAM generate candidates and then compute the sids set of the candidate to calculate its support. It may generate many candidates that are not frequent, therefore wasting time. It can also generate candidates that do not appear in the database.
- it is difficult to extend SPAM with additional constraints compared to PrefixSpan
- the database has to be stored in memory
-. ...

What is good about PrefixSpan
- a pattern growth approach. It will not generate candidates that do not appear in the database.
- it includes some optimizations like pseudo-projection, etc.
- it is easy to extend PrefixSpan with additional constraints, for mining closed patterns, etc.
-. ...

What is weak about PrefixSpan
- unlike SPAM, it does not use a vertical representation. So it may need to scan the database several times (even if it uses pseudo-projection).
- the database has to be stored in memory (could be stored on disk, but would be slower)
- ...



Edited 7 time(s). Last edit at 04/28/2012 04:27PM by webmasterphilfv.

Options: ReplyQuote
Re: SPAM vs PrefixSpan
Posted by: Dvijesh88
Date: April 29, 2012 06:27AM

hello sir,
you are right about PrefixSpan as far i know.

yes that both problem has been solved by one algorithm atleast it claim that they over come both the problem which mention here. i sent you that paper.

so sir cant we represent the dataset in vertical format and apply prefixspan? yeah we have to do some changes in algorithm....
this is just question. and idea i can be wrong.

Options: ReplyQuote
Re: SPAM vs PrefixSpan
Date: May 01, 2012 09:15AM

Hi Dvijesh,

In my opinion, it would not make sense to use only a vertical database with prefixspan because PrefixSpan need to scan the sequences, which cannot be made efficiently with a vertical database. Vertical databases are better for candidate generation like in ECLAT, SPAM or APRIORITID.

But perhaps that it would be possible to do something like FPGrowth. FPGrowth uses a horizontal database stored into a FPTree. But it also has a header table that allows to find all the sequences that contain an item. This header table could be considered as kind of vertical database.

Perhaps that something like that could be made or has been already made for PrefixSpan. But i'm not sure how this information would be useful... or if it could be used to make the algorithm faster. I did not think a lot about this. tongue sticking out smiley

Best,

Phlippe

Options: ReplyQuote
Re: SPAM vs PrefixSpan
Posted by: tisonet
Date: May 01, 2012 11:55AM

Dvijesh88 Wrote:
-------------------------------------------------------
... i sent you that paper.

Can I ask for that paper?

I think that solution for avoiding database scan is use ITEM_IS_EXIST_TABLE which was defined in LAPIN algorithm.

Options: ReplyQuote
Re: SPAM vs PrefixSpan
Posted by: Dvijesh88
Date: May 01, 2012 11:16PM

yes sir
you can ask
give me mail ID, i will sent it to u.

Options: ReplyQuote
Re: SPAM vs PrefixSpan
Posted by: Dvijesh88
Date: June 29, 2012 07:48AM

Why SPAM take more time to run compare to PrefixSpan?

Options: ReplyQuote
Re: SPAM vs PrefixSpan
Date: June 29, 2012 11:09AM

I think that it may depends on the datasets.

But,in my opinion PrefixSpan is a better algorithm because it uses a pattern-growth approach. It only generates sequential patterns that are in the database.

On the other hand, SPAM can generate lot of candidates that do not appear in the database. But SPAM can still be fast because it uses bit vectors, and it is very efficient to calculate the support of a pattern. However, the advantage of using bit vectors is probably not enough to overcome the disadvantage of generating too much candidates.

That is my opinion.

Beside that, it is possible that I have not done all the optimizations in my implementation. But still, I don't think that it would change much the performance of SPAM compared to PrefixSpan because it generate candidates.

This is my opinion.

Best,

Philippe

Options: ReplyQuote
Re: SPAM vs PrefixSpan
Posted by: arina
Date: June 26, 2013 12:10PM

My name is Aina form Indonesia. I was doing research for my graduate program in computer science

I use data library, where the data is to use a data book, thesis for 2 years (2011-2012) at the library IPB. Which algorithms are suitable for use. whether prefixspan or Spam?

please help

best regards
arina......

Options: ReplyQuote
Re: SPAM vs PrefixSpan
Date: June 26, 2013 04:40PM

Hi,

I understand that your data is a set of text documents.

Sequential pattern mining algorithms like SPAM and PrefixSpan have the same input and produce the same output. You could use any of them and the result will be the same. The different between these two algorithms is how they work internally.

Algorithms like SPAM and prefixSpan will find sequential patterns occuring in a set of sequences. For text files, you could consider that each sentence is a sequence for example. Then, if you apply these algorithms you could find some sequences of words that appear in several sentences. If this is useful for you, then you could apply one of these algorithms.

Otherwise, there are other algorithms that you could consider. But it depends on what you want to do. You told me that you have text files. But you did not tell me what you want to do... what is your goal.

Best,

Philippe

Options: ReplyQuote
Re: SPAM vs PrefixSpan
Posted by: Arina
Date: June 26, 2013 07:30PM

terima kasih sudah membalasnya pak..

saya lupa memberi tahu tujuan penelitian saya, saya minta maaf

saya menggunakan data perpustakaan utnuk menemukan pola transaksi. mirip seperti 'market basket analys' pada transaksi penjualan. hanya saja disini menjadi transaksi peminjaman buku atau thesis.

saya ingin memberikan 'prioritas peminjaman' kepada anggota perpustakaan . misalnya seseorang meminjam buku tentang 'datamining' maka sistem tersebut akan memberikan beberapa referensi peminjaman buku yang memiliki keterkaitan dengan buku yang dipinjam sebelumnya . prioritas ini tentunya diambil dari pola peminjaman anggota perpustakaan yang sudah dicatat oleh pustakawan setiap harinya

selain itu, aturan ini akan memudahkan pustakawan untuk menempatkan buku pada rak-rak yang sesuai..

begitu kira-kira gambaran tujuannya pak

mohon bantuannya dan terimakasih banyak

best regard
arina

Options: ReplyQuote
Re: SPAM vs PrefixSpan
Date: June 27, 2013 04:25AM

I cannot understand what you wrote. I only speak English, French and a little bit Chinese.

Philippe

Options: ReplyQuote


This forum is powered by Phorum and provided by P. Fournier-Viger (© 2012).
Terms of use.