Re: Any idea or suggestion for my projet data mining?
Posted by:
Ju PENG
Date: April 22, 2013 06:37AM
Hi, Philippe
Thanks a lot for your answer!
I thinked about to use K-Nearest Neighbor algorithm, it can be a solution, but there is also a problem: in the 2 million invoice, there are more than 6000 provider, that is to say, i have to built 6000 classes and in each class there should be several instance(if 5, we have 30000 instance in the training set), the training set seems to be too large. For each new instance, should I calcul the similarity with all those instance in the training set? I think its too much. And there are some providers who havn't the key number, which we can not build a category for them, if we used KNN, they will be classfied to the wrong class.
You are right, because the data comes from the OCR, so there are some words in bad form, some lettres missing or replaced. I have met the problem when i compared the email and adresse.
I am really confused by this problem, if the words are well written, there would be less problem.
Thanks for ur helps, I will think about other algorithme you have talk about.
Ju