Reputation: 63
I am very new to programming so this can be a silly question.I wanted to learn to scrape web pages. so I learned BeautifulSoup to do it.....worked for few sites but got stuck on the following page
from bs4 import BeautifulSoup
import requests
r = requests.get("http://www.dlb.today/result/en")
data = r.text
soup = BeautifulSoup(data, "lxml")
data = soup.find_all("tbody", {"id": "pageData1"})
data2 = soup.find_all("ul", {"class": "res_allnumber"})
print data
print data2
#no point going further if I cant get raw data I think
this worked fine (a similar site I scraped)
r2 = requests.get("http://www.nlb.lk/results-more.php?id=1")
data2 = r2.text
soup2 = BeautifulSoup(data2, "lxml")
news2 = soup2.find_all("a", {"class": "lottery-numbers"})
#print news2 #(get raw Html for checking)
for draw_number in news2:
print draw_number.contents[0]
I couldn't scrape the table I wanted.so I tried LXML to do it...still no luck.............
#lxml
import requests
r = requests.get("http://www.dlb.today/result/en")
data = r.text
#print data
import lxml.html as LH
content = data
root = LH.fromstring(content)
for tag1 in root.xpath('//tbody[@class="pageData1"]//li'):
print tag1.text_content()
I don't know where is my error or what to do next......if anyone can anyone point me in the right direction I appreciate it !
Upvotes: 1
Views: 606
Reputation: 9430
There is JavaScript involved in loading data to display this page. Fortunately the JavaScript loads another HTML page from the URL
http://www.dlb.today/result/pagination_re
You can access this URL with a POST request directly like this:
import requests
from bs4 import BeautifulSoup
url = "http://www.dlb.today/result/pagination_re"
data = {"pageId": "0", "resultID": "1001", "lotteryID": "1", "lastsegment": "en"}
page = requests.post(url, data)
soup = BeautifulSoup(page.content,'html.parser')
for data in soup.find_all("ul", {"class": "res_allnumber"}):
print (data)
You may have to experiment with the "data" values to get exactly what you want!
The output is:
<ul class="res_allnumber"><li class="res_number">04</li><li class="res_number">30</li><li class="res_number">44</li><li class="res_number">56</li><li class="res_number" style="background-color: #971B7E; color: #fff;">29</li><li class="res_eng_letter">V</li></ul>
<ul class="res_allnumber"><li class="res_number">15</li><li class="res_number">41</li><li class="res_number">43</li><li class="res_number">47</li><li class="res_number" style="background-color: #016B21; color: #fff;">69</li><li class="res_eng_letter">Z</li></ul>
<ul class="res_allnumber"><li class="res_number">09</li><li class="res_number">13</li><li class="res_number">17</li><li class="res_number">48</li><li class="res_number" style="background-color: #267FFF; color: #fff;">73</li><li class="res_eng_letter">D</li></ul>
<ul class="res_allnumber"><li class="res_number">31</li><li class="res_number">41</li><li class="res_number">43</li><li class="res_number">55</li><li class="res_number" style="background-color: #971B7E; color: #fff;">52</li><li class="res_eng_letter">U</li></ul>
<ul class="res_allnumber"><li class="res_number">03</li><li class="res_number">09</li><li class="res_number">19</li><li class="res_number">73</li><li class="res_number" style="background-color: #016B21; color: #fff;">67</li><li class="res_eng_letter">E</li></ul>
<ul class="res_allnumber"><li class="res_number">17</li><li class="res_number">22</li><li class="res_number">35</li><li class="res_number">39</li><li class="res_number" style="background-color: #267FFF; color: #fff;">59</li><li class="res_eng_letter">Z</li></ul>
<ul class="res_allnumber"><li class="res_number">08</li><li class="res_number">15</li><li class="res_number">30</li><li class="res_number">55</li><li class="res_number" style="background-color: #971B7E; color: #fff;">71</li><li class="res_eng_letter">I</li></ul>
<ul class="res_allnumber"><li class="res_number">11</li><li class="res_number">16</li><li class="res_number">50</li><li class="res_number">57</li><li class="res_number" style="background-color: #016B21; color: #fff;">75</li><li class="res_eng_letter">Q</li></ul>
<ul class="res_allnumber"><li class="res_number">27</li><li class="res_number">30</li><li class="res_number">43</li><li class="res_number">71</li><li class="res_number" style="background-color: #267FFF; color: #fff;">63</li><li class="res_eng_letter">E</li></ul>
<ul class="res_allnumber"><li class="res_number">19</li><li class="res_number">20</li><li class="res_number">31</li><li class="res_number">43</li><li class="res_number" style="background-color: #971B7E; color: #fff;">61</li><li class="res_eng_letter">I</li></ul>
<ul class="res_allnumber"><li class="res_number">24</li><li class="res_number">41</li><li class="res_number">47</li><li class="res_number">72</li><li class="res_number" style="background-color: #016B21; color: #fff;">32</li><li class="res_eng_letter">K</li></ul>
<ul class="res_allnumber"><li class="res_number">13</li><li class="res_number">51</li><li class="res_number">61</li><li class="res_number">65</li><li class="res_number" style="background-color: #267FFF; color: #fff;">48</li><li class="res_eng_letter">E</li></ul>
Upvotes: 1
Reputation: 1515
I tried replicating your use-case. It seems the data is not be loaded in the page and the python code has already made a request. As a result, the "tbody" and its content is empty.
I did confirm by downloading the HTML file
fh = open('sample.html','w')
fh.write(data)
fh.close()
There are a couple of solutions mentioned on the web to resolve this issue:
Using the Python library called dryscrape. The details are mentioned Web-scraping JavaScript page with Python
Using selenium:
from selenium import webdriver import time driver = webdriver.Firefox(executable_path = 'geckodriver.exe') driver.get("http://www.dlb.today/result/en") time.sleep(5) htmlSource = driver.page_source
Download geckodriver from here. Further you can use htmlsource as an input to BeautifulSoup
Upvotes: 1