cv14
cv14

Reputation: 77

Not able to scrape table data using BeautifulSoup from a website

I am following a tutorial online (https://www.analyticsvidhya.com/blog/2015/10/beginner-guide-web-scraping-beautiful-soup-python/) for web scraping an html table. When I followed the tutorial I was able to scrape the table data, but I when I tried to scrape the data from this (https://www.masslottery.com/games/lottery/search/results-history.html?game_id=15&mode=2&selected_date=2019-03-04&x=12&y=11) website I was not able to do so.

I tried using scrapy before but got the same results.

Here is the code I used.

import urllib.request

wiki = "https://www.masslottery.com/games/lottery/search/results-history.html?game_id=15&mode=2&selected_date=2019-03-04&x=12&y=11"
page = urllib.request.urlopen(wiki)
from bs4 import BeautifulSoup
soup = BeautifulSoup(page, "lxml")


all_tables=soup.find_all('table')


right_table=soup.find('table', class_='zebra-body-only')
print(right_table)

This is what I get when I run this code on the terminal

<table cellspacing="0" class="zebra-body-only">
<tbody id="target-area">
</tbody>
</table>

Although when I inspect the mass lottery's website using google chrome this is what I see

<table cellspacing="0" class="zebra-body-only"                                  <tbody id="target-area">
<tr class="odd">
<th>Draw #</th>
<th>Draw Date</th>
<th>Winning Number</th>
<th>Bonus</th>
</tr>
<tr><td>2107238</td>
<td>03/04/2019</td>
<td>01-04-05-16-23-24-27-32-34-41-42-44-47-49-52-55-63-65-67-78</td><td>No Bonus</td>
</tr>
<tr class="odd">
<td>2107239</td>
<td>03/04/2019</td>
<td>04-05-11-15-19-20-23-24-25-28-41-45-52-63-64-68-71-72-73-76</td><td>4x</td>
</tr> 
....(And so on)

I want to be able to extract data from this table.

Upvotes: 1

Views: 1185

Answers (3)

Edo Edo
Edo Edo

Reputation: 164

yea i would save the data you get in a file to see if what you're looking for is really there. with open('stuff.html','w') as f: f.write(response.text)

unicode, try: import codecs codecs.open(fp,'w','utf-8') as f:

if u dont see what ur looking for there, you will have to figure out the right url to load, check chrome developer options this is hard usually

easy route is to use selenium make sure u wait until what you're looking for has appeared on page (that's dynamic)

Upvotes: 0

chitown88
chitown88

Reputation: 28565

The page is dynamic, so it's rendered after you make the request. You can either a) use the solution by JC1 and access the json response. Or you can use Seleneium to simulate opening the browser, rendering the page, then grabbing the table:

from bs4 import BeautifulSoup
from selenium import webdriver


url = 'https://www.masslottery.com/games/lottery/search/results-history.html?game_id=15&mode=2&selected_date=2019-03-04&x=12&y=11'  

driver = webdriver.Chrome()
driver.get(url)
page = driver.page_source

soup = BeautifulSoup(page, "lxml")

all_tables=soup.find_all('table')


right_table=soup.find('table', class_='zebra-body-only')

Also as a side note: Usually if I see <table> tags, I'll let Pandas do the work for me (note, I'm blocked from accessing the site, so can't test these):

import pandas as pd
from selenium import webdriver


url = 'https://www.masslottery.com/games/lottery/search/results-history.html?game_id=15&mode=2&selected_date=2019-03-04&x=12&y=11'  

driver = webdriver.Chrome()
driver.get(url)
page = driver.page_source

# will return a list of dataframes
tables = pd.read_html(page)

# chose the dataframe you want from the list by it's position
df = tables[0]

Upvotes: 0

JC1
JC1

Reputation: 879

This is happening because the website does another call to load the results. The initial link only loads the page but not the results. Using chrome dev tools to inspect requests, you'll be able to find out the request you need to replicate to get the results.

This means that to get the results, you can just call the request mentioned above and not have to call the web-page at all.

Fortunately, the endpoint you have to call is already in a nice JSON format.

GET https://www.masslottery.com/data/json/search/dailygames/history/15/201903.json?_=1555083561238

Where I assume 1555083561238 is the timestamp.

Upvotes: 1

Related Questions