Reputation: 4429
Let's say i want to scrape this page: https://twitter.com/nfl
from bs4 import BeautifulSoup
import requests
page = 'https://twitter.com/nfl'
r = requests.get(page)
soup = BeautifulSoup(r.text)
print soup
The more i scroll down on the page, the more results show up. But this above request only gives me the initial load. How do i get all the information of the page as if I were to manually scroll down?
Upvotes: 0
Views: 3586
Reputation: 1
For dynamically generated content, the data is usually in json format. So we have to inspect the page, go to network option and find the link which will give us the data/response on the fly. For example : The page - https://techolution.app.param.ai/jobs/ the data is generated dynamically. For that I got this link - https://techolution.app.param.ai/api/career/get_job/?query=&locations=&category=&job_types=
After that the web scraping becomes a bit easy and I have done that in python using Anaconda Navigator. Here is the github link for that - https://github.com/piperaprince01/Webscraping_python/blob/master/WebScraping.ipynb
If you can make any changes to make it better then feel free to do so. Thank You.
Upvotes: 0
Reputation: 473763
Better solution is to use Twitter API.
There are several python twitter API clients, for example:
Upvotes: 1
Reputation: 39355
First parse the data-max-id="451819302057164799"
value from the html source.
Then using the id 451819302057164799
construct an url like below:
https://twitter.com/i/profiles/show/nfl/timeline?include_available_features=1&include_entities=1&max_id=451819302057164799
Now get the html source of the link and parse using simplejson
or any other json library.
Remember, the next page load(when you scroll down) is available from the value "max_id":"451369755908530175"
in that json.
Upvotes: 4
Reputation: 742
If the content is dynamically added with javascript, your best chance is to use selenium to control a headless browser like phantomjs, use the selenium webdriver to simulate the scrolldown, add a wait for the new content to load, and only then extract the html and feed it to your BS parser.
Upvotes: 1