I have a list of consumer product manuals ( about 100,000 .pdf files ) scrapped from the web . now i want to categorize the files by manufacturer/brand and a category it belongs . For example : Samsung -> Monitors -> [ files list ] Samsung -> Mobile Phones -> [ files list ] etc ... What have i done so far : built a list of brands/manufacturers, and a list of categories . extracted all the data as text from pdf files using pyPdf tokenized the words from a text data with NLTK it looks like this : ... ('3Com', 'CD') ('Corporation', 'NNP') ('reserves', 'NNS') ('the', 'DT') ('right', 'NN') ('to', 'TO') ('revise', 'VB') ('this', 'DT') ('documentation', 'NN') ('and', 'CC') ('to', 'TO') ('make', 'VB') ('changes', 'NNS') ('in', 'IN') ('content', 'NN') ('from', 'IN') ... The problem i face now: How can i match the tokens against my brand/category lists ? i have never got a chance to work with NLP before , and i am sort of still trying to wrap my brain around this .

Reputation: 12785

Extract brand and product category from consumer product manuals

I have a list of consumer product manuals ( about 100,000 .pdf files ) scrapped from the web . now i want to categorize the files by manufacturer/brand and a category it belongs .
For example :

Samsung -> Monitors -> [ files list ]
Samsung -> Mobile Phones -> [ files list ]
etc ...

What have i done so far :

built a list of brands/manufacturers, and a list of categories .
extracted all the data as text from pdf files using pyPdf
tokenized the words from a text data with NLTK
- it looks like this : ... ('3Com', 'CD') ('Corporation', 'NNP') ('reserves', 'NNS') ('the', 'DT') ('right', 'NN') ('to', 'TO') ('revise', 'VB') ('this', 'DT') ('documentation', 'NN') ('and', 'CC') ('to', 'TO') ('make', 'VB') ('changes', 'NNS') ('in', 'IN') ('content', 'NN') ('from', 'IN') ...

The problem i face now:

How can i match the tokens against my brand/category lists ?
i have never got a chance to work with NLP before , and i am sort of still trying to wrap my brain around this .

Upvotes: 2

Answers (2)

dwatson

Reputation: 41

I would suggest a hybrid approach. Use a POS tagger to find NNP proper nouns then look them up in a company name dictionary.

This saves you from looking up determiners and other unlikely words. This should increase precision by reducing false positives where someone might use a company name as a verb (xerox, google) for example. On the downside it might reduce recall by increasing false negatives where a company name gets miss tagged and never looked up in your dictionary.

Upvotes: 0

Flavian Hautbois

Reputation: 3070

I am not sure this is a NLP issue. Here is how I would do it:

brand_names = ['Samsung', 'Lenovo', ...]
category_names = ['Monitors', 'Mobile Phones', ...]

pdf_string = read_my_pdf('theproduct.pdf')
pdf_string_lowered = pdf_string.lower()

brand_names_in_pdf = [brand.lower() in pdf_string_lowered for brand in brand_names] #Everything is lowered to account for case difference
category_names_in_pdf = [category.lower() in pdf_string_lowered for category in category_names]

import itertools
tags = itertools.product(brand_names_in_pdf, category_names_in_pdf)  #Get the tuples of brands and categories

This will seem very simple but I think it will work better than any NLP tool you would be using (how would you know if a specific model number is that of a mobile phone, or maybe some words related to mobile phones will be contained in PDF about something else). I think an exhaustive search is more robust.

The only real drawback of this method is related to variations in the words you are looking for. I think a solution to this would be to use regular expressions instead of tokens. For instance, you could accept 'Mobile Phones' or 'Mobile Phone', and categorize them in 'Mobile Phones'.

Upvotes: 1

Extract brand and product category from consumer product manuals

Answers (2)

Related Questions