Reputation: 321
I am trying to extract data from a web site and the data are in a table :
url=requests.get("xxxxx")
soup =BeautifulSoup(url.content)
table=soup.find_all("table")[0]
rows = table.find_all('tr')
I tried this code it works but only 42 lines are extracted and the source table contains 220 lines ? someone tell me how to fix this.
Upvotes: 0
Views: 71
Reputation: 1417
Welcome.
2 possibilities. Javascript or website security.
requests
is javscript agnostic and doesn't execute any javascript code. You'll want a headless browser solution (selenium
is popular) that more closely mimicks a browser, especially when it comes to javascript.
Many websites don't want to be scraped and employ different methods to prevent it. The simplest form is checking the User-Agent
value of the client (your Python
script) or rate-throttling (20k refreshes a second isn't human). e.g., if the User-Agent
is anything other than a known value, it'll behave differently (little or no data). Other forms of defense are more complex. Such as trying to play audio on your "browser" or polling your "browser"'s resolution. For that you'll need to investigate the site's behavior. This can take time. You can start off with either the Networking
tab of your browser's developing tools (F12 on Firefox) or Zap Proxy for more refined control.
Upvotes: 1