Reputation: 77
I am following a tutorial online (https://www.analyticsvidhya.com/blog/2015/10/beginner-guide-web-scraping-beautiful-soup-python/) for web scraping an html table. When I followed the tutorial I was able to scrape the table data, but I when I tried to scrape the data from this (https://www.masslottery.com/games/lottery/search/results-history.html?game_id=15&mode=2&selected_date=2019-03-04&x=12&y=11) website I was not able to do so.
I tried using scrapy before but got the same results.
Here is the code I used.
import urllib.request
wiki = "https://www.masslottery.com/games/lottery/search/results-history.html?game_id=15&mode=2&selected_date=2019-03-04&x=12&y=11"
page = urllib.request.urlopen(wiki)
from bs4 import BeautifulSoup
soup = BeautifulSoup(page, "lxml")
all_tables=soup.find_all('table')
right_table=soup.find('table', class_='zebra-body-only')
print(right_table)
This is what I get when I run this code on the terminal
<table cellspacing="0" class="zebra-body-only">
<tbody id="target-area">
</tbody>
</table>
Although when I inspect the mass lottery's website using google chrome this is what I see
<table cellspacing="0" class="zebra-body-only" <tbody id="target-area">
<tr class="odd">
<th>Draw #</th>
<th>Draw Date</th>
<th>Winning Number</th>
<th>Bonus</th>
</tr>
<tr><td>2107238</td>
<td>03/04/2019</td>
<td>01-04-05-16-23-24-27-32-34-41-42-44-47-49-52-55-63-65-67-78</td><td>No Bonus</td>
</tr>
<tr class="odd">
<td>2107239</td>
<td>03/04/2019</td>
<td>04-05-11-15-19-20-23-24-25-28-41-45-52-63-64-68-71-72-73-76</td><td>4x</td>
</tr>
....(And so on)
I want to be able to extract data from this table.
Upvotes: 1
Views: 1185
Reputation: 164
yea i would save the data you get in a file to see if what you're looking for is really there. with open('stuff.html','w') as f: f.write(response.text)
unicode, try: import codecs codecs.open(fp,'w','utf-8') as f:
if u dont see what ur looking for there, you will have to figure out the right url to load, check chrome developer options this is hard usually
easy route is to use selenium make sure u wait until what you're looking for has appeared on page (that's dynamic)
Upvotes: 0
Reputation: 28565
The page is dynamic, so it's rendered after you make the request. You can either a) use the solution by JC1 and access the json response. Or you can use Seleneium to simulate opening the browser, rendering the page, then grabbing the table:
from bs4 import BeautifulSoup
from selenium import webdriver
url = 'https://www.masslottery.com/games/lottery/search/results-history.html?game_id=15&mode=2&selected_date=2019-03-04&x=12&y=11'
driver = webdriver.Chrome()
driver.get(url)
page = driver.page_source
soup = BeautifulSoup(page, "lxml")
all_tables=soup.find_all('table')
right_table=soup.find('table', class_='zebra-body-only')
Also as a side note: Usually if I see <table>
tags, I'll let Pandas do the work for me (note, I'm blocked from accessing the site, so can't test these):
import pandas as pd
from selenium import webdriver
url = 'https://www.masslottery.com/games/lottery/search/results-history.html?game_id=15&mode=2&selected_date=2019-03-04&x=12&y=11'
driver = webdriver.Chrome()
driver.get(url)
page = driver.page_source
# will return a list of dataframes
tables = pd.read_html(page)
# chose the dataframe you want from the list by it's position
df = tables[0]
Upvotes: 0
Reputation: 879
This is happening because the website does another call to load the results. The initial link only loads the page but not the results. Using chrome dev tools to inspect requests, you'll be able to find out the request you need to replicate to get the results.
This means that to get the results, you can just call the request mentioned above and not have to call the web-page at all.
Fortunately, the endpoint you have to call is already in a nice JSON format.
GET https://www.masslottery.com/data/json/search/dailygames/history/15/201903.json?_=1555083561238
Where I assume 1555083561238
is the timestamp.
Upvotes: 1