Reputation: 12785
I have a list of consumer product manuals ( about 100,000 .pdf files ) scrapped from the web . now i want to categorize the files by manufacturer/brand and a category it belongs .
For example :
Samsung -> Monitors -> [ files list ]
Samsung -> Mobile Phones -> [ files list ]
etc ...
What have i done so far :
...
('3Com', 'CD')
('Corporation', 'NNP')
('reserves', 'NNS')
('the', 'DT')
('right', 'NN')
('to', 'TO')
('revise', 'VB')
('this', 'DT')
('documentation', 'NN')
('and', 'CC')
('to', 'TO')
('make', 'VB')
('changes', 'NNS')
('in', 'IN')
('content', 'NN')
('from', 'IN')
...
The problem i face now:
How can i match the tokens against my brand/category lists ?
i have never got a chance to work with NLP before , and i am sort of still trying to wrap my brain around this .
Upvotes: 2
Views: 1617
Reputation: 41
I would suggest a hybrid approach. Use a POS tagger to find NNP proper nouns then look them up in a company name dictionary.
This saves you from looking up determiners and other unlikely words. This should increase precision by reducing false positives where someone might use a company name as a verb (xerox, google) for example. On the downside it might reduce recall by increasing false negatives where a company name gets miss tagged and never looked up in your dictionary.
Upvotes: 0
Reputation: 3070
I am not sure this is a NLP issue. Here is how I would do it:
brand_names = ['Samsung', 'Lenovo', ...]
category_names = ['Monitors', 'Mobile Phones', ...]
pdf_string = read_my_pdf('theproduct.pdf')
pdf_string_lowered = pdf_string.lower()
brand_names_in_pdf = [brand.lower() in pdf_string_lowered for brand in brand_names] #Everything is lowered to account for case difference
category_names_in_pdf = [category.lower() in pdf_string_lowered for category in category_names]
import itertools
tags = itertools.product(brand_names_in_pdf, category_names_in_pdf) #Get the tuples of brands and categories
This will seem very simple but I think it will work better than any NLP tool you would be using (how would you know if a specific model number is that of a mobile phone, or maybe some words related to mobile phones will be contained in PDF about something else). I think an exhaustive search is more robust.
The only real drawback of this method is related to variations in the words you are looking for. I think a solution to this would be to use regular expressions instead of tokens. For instance, you could accept 'Mobile Phones' or 'Mobile Phone', and categorize them in 'Mobile Phones'.
Upvotes: 1