towi_parallelism
towi_parallelism

Reputation: 1471

Python or JS-based REST API for Web scraping

I am trying to build a Python/JS Web Service through a REST API.

My scenario is as follows:

  1. User clicks on a button on my website
  2. My website sends an HTTP Request to the REST API
  3. Web scraping happens on the Server-side (using either Python or Node). The data on the third-party website is loaded dynamically.
  4. The results are sent back in JSON format to my website to be shown to the user

I checked a number of Python hosting services. I cannot tell if they support Selenium or not. Same for the JS libraries and NodeJS hostings.

Basically, I'm confused. What should I use for my project and scraping dynamic data? Python with Selenium? NodeJS with PhantomJS and Cheerio?

Upvotes: 2

Views: 1058

Answers (1)

Ahmed Magdy
Ahmed Magdy

Reputation: 172

Neither Selenium(alone) nor CheerIO will give you the ability to load the data dynamically from a third-party website.

The answer you're searching for is PhantomJs. Using PhantomJS allows you to load the data dynamically from the third-party website and interact with it using Javascript, you can do things such as scroll down to request more data, and start scrapping when new content is added.

I worked on similar project myself. I was scraping data from facebook while interacting with the page through Javascript and scrap data after a bunch of interactions to load all the data I need to scrap, then save all this data in XML files to store them later on an OrientDB database. In this project I used Selenium along with PhantomJS driver, but PhantomJS is already a NodeJs framework, however I used Python because this project was expected to be larger and contain more data science stuff.

In your case, if the scenario is just scraping the data then retrieve it to remote host/client, then I recommend Node + PhantomJS to you.

Upvotes: 1

Related Questions