Reputation: 11
My task is to extract information from a various web-pages of a particular site. Now, the information to be extracted can be of the form as product name, product id, price, etc. The information is given in text using natural language. Also, I have been asked to extract that information using some Machine Learning algorithm. I thought of using NER (Named Entity Recognition) and training it on custom training data (which I can prepare using the scraped data and manually labeling the integers/data as required). I wanted to know if the model can even work this way?
Also, let me know if I can improve this question further.
Upvotes: 0
Views: 520
Reputation: 2554
You say a particular site. I am assuming that it means you have some fair idea of what the structures of webpages are, if the data is in table form or a free text form, how the website generally looks. In this case, a simple regex (prices, ids etc) supported by some POS tagger to extract product names and all is enough for you. A supervised approach is definitely an overkill and might underperform than the simpler regex.
Upvotes: 0