Reputation: 693
I need to extract common data from different websites. Like I want to scrape 100 event websites and extract the same information like event name, price, location etc. Every website has a different layout so I'm writing scraping rules by hand. There are some services like diffbot that can extract this automatically. They are using some sort of AI/ML model. I was wondering if this can be a Named Entity task or maybe LSTM can be used.
Upvotes: 0
Views: 656
Reputation: 455
To add to the previous response, don't forget to check if the websites you scrape have an API, which could highly reduce time spent coding, and be more reliable if the websites change their layouts.
You probably already checked it, but that doesn't hurt to remind this.
Upvotes: 0
Reputation: 1688
Without more details on the structure/format of your targeted websites, it's difficult to go beyond a generic answer.
If these are mostly text based (ie natural text not semi structured with table and all), then it seems like a classic information extraction (IE) of named entities. LSTM is an architecture that could be used for this like the ones in spacy. Many other classic NLP libraries like stanfordNLP can also be of use (not always with deep learning).
How to make the choice? It will depends on the type of language in these pages. If it's more natural English, then DL models could be better. If this is a domain jargon (small dataset to learn), you might need to investigate more grammar based analysis.
Upvotes: 1