Alexander
Alexander

Reputation: 12785

Extract brand and product category from consumer product manuals

I have a list of consumer product manuals ( about 100,000 .pdf files ) scrapped from the web . now i want to categorize the files by manufacturer/brand and a category it belongs .
For example :

Samsung -> Monitors -> [ files list ]
Samsung -> Mobile Phones -> [ files list ]
etc ...

What have i done so far :

The problem i face now:

How can i match the tokens against my brand/category lists ?
i have never got a chance to work with NLP before , and i am sort of still trying to wrap my brain around this .

Upvotes: 2

Views: 1617

Answers (2)

dwatson
dwatson

Reputation: 41

I would suggest a hybrid approach. Use a POS tagger to find NNP proper nouns then look them up in a company name dictionary.

This saves you from looking up determiners and other unlikely words. This should increase precision by reducing false positives where someone might use a company name as a verb (xerox, google) for example. On the downside it might reduce recall by increasing false negatives where a company name gets miss tagged and never looked up in your dictionary.

Upvotes: 0

Flavian Hautbois
Flavian Hautbois

Reputation: 3070

I am not sure this is a NLP issue. Here is how I would do it:

brand_names = ['Samsung', 'Lenovo', ...]
category_names = ['Monitors', 'Mobile Phones', ...]

pdf_string = read_my_pdf('theproduct.pdf')
pdf_string_lowered = pdf_string.lower()

brand_names_in_pdf = [brand.lower() in pdf_string_lowered for brand in brand_names] #Everything is lowered to account for case difference
category_names_in_pdf = [category.lower() in pdf_string_lowered for category in category_names]

import itertools
tags = itertools.product(brand_names_in_pdf, category_names_in_pdf)  #Get the tuples of brands and categories

This will seem very simple but I think it will work better than any NLP tool you would be using (how would you know if a specific model number is that of a mobile phone, or maybe some words related to mobile phones will be contained in PDF about something else). I think an exhaustive search is more robust.

The only real drawback of this method is related to variations in the words you are looking for. I think a solution to this would be to use regular expressions instead of tokens. For instance, you could accept 'Mobile Phones' or 'Mobile Phone', and categorize them in 'Mobile Phones'.

Upvotes: 1

Related Questions