Niranga Sithara
Niranga Sithara

Reputation: 63

can't scrape web page from BeautifulSoup or lxml

I am very new to programming so this can be a silly question.I wanted to learn to scrape web pages. so I learned BeautifulSoup to do it.....worked for few sites but got stuck on the following page

from bs4 import BeautifulSoup
import requests

r  = requests.get("http://www.dlb.today/result/en")
data = r.text
soup = BeautifulSoup(data, "lxml")

data = soup.find_all("tbody", {"id": "pageData1"})
data2 = soup.find_all("ul", {"class": "res_allnumber"})
print data
print data2
#no point going further if I cant get raw data I think

this worked fine (a similar site I scraped)

r2  = requests.get("http://www.nlb.lk/results-more.php?id=1")
data2 = r2.text
soup2 = BeautifulSoup(data2, "lxml")
news2 = soup2.find_all("a", {"class": "lottery-numbers"})
#print news2 #(get raw Html for checking)
for draw_number in news2:
   print draw_number.contents[0]

I couldn't scrape the table I wanted.so I tried LXML to do it...still no luck.............

#lxml
import requests

r  = requests.get("http://www.dlb.today/result/en")
data = r.text

#print data

import lxml.html as LH

content = data
root = LH.fromstring(content)
for tag1 in root.xpath('//tbody[@class="pageData1"]//li'):  
    print tag1.text_content()

I don't know where is my error or what to do next......if anyone can anyone point me in the right direction I appreciate it !

Upvotes: 1

Views: 606

Answers (2)

Dan-Dev
Dan-Dev

Reputation: 9430

There is JavaScript involved in loading data to display this page. Fortunately the JavaScript loads another HTML page from the URL

http://www.dlb.today/result/pagination_re

You can access this URL with a POST request directly like this:

import requests
from bs4 import BeautifulSoup

url = "http://www.dlb.today/result/pagination_re"
data = {"pageId": "0", "resultID": "1001", "lotteryID": "1", "lastsegment": "en"}
page = requests.post(url, data)
soup = BeautifulSoup(page.content,'html.parser')
for data in soup.find_all("ul", {"class": "res_allnumber"}):
    print (data)

You may have to experiment with the "data" values to get exactly what you want!

The output is:

<ul class="res_allnumber"><li class="res_number">04</li><li class="res_number">30</li><li class="res_number">44</li><li class="res_number">56</li><li class="res_number" style="background-color: #971B7E; color: #fff;">29</li><li class="res_eng_letter">V</li></ul>
<ul class="res_allnumber"><li class="res_number">15</li><li class="res_number">41</li><li class="res_number">43</li><li class="res_number">47</li><li class="res_number" style="background-color: #016B21; color: #fff;">69</li><li class="res_eng_letter">Z</li></ul>
<ul class="res_allnumber"><li class="res_number">09</li><li class="res_number">13</li><li class="res_number">17</li><li class="res_number">48</li><li class="res_number" style="background-color: #267FFF; color: #fff;">73</li><li class="res_eng_letter">D</li></ul>
<ul class="res_allnumber"><li class="res_number">31</li><li class="res_number">41</li><li class="res_number">43</li><li class="res_number">55</li><li class="res_number" style="background-color: #971B7E; color: #fff;">52</li><li class="res_eng_letter">U</li></ul>
<ul class="res_allnumber"><li class="res_number">03</li><li class="res_number">09</li><li class="res_number">19</li><li class="res_number">73</li><li class="res_number" style="background-color: #016B21; color: #fff;">67</li><li class="res_eng_letter">E</li></ul>
<ul class="res_allnumber"><li class="res_number">17</li><li class="res_number">22</li><li class="res_number">35</li><li class="res_number">39</li><li class="res_number" style="background-color: #267FFF; color: #fff;">59</li><li class="res_eng_letter">Z</li></ul>
<ul class="res_allnumber"><li class="res_number">08</li><li class="res_number">15</li><li class="res_number">30</li><li class="res_number">55</li><li class="res_number" style="background-color: #971B7E; color: #fff;">71</li><li class="res_eng_letter">I</li></ul>
<ul class="res_allnumber"><li class="res_number">11</li><li class="res_number">16</li><li class="res_number">50</li><li class="res_number">57</li><li class="res_number" style="background-color: #016B21; color: #fff;">75</li><li class="res_eng_letter">Q</li></ul>
<ul class="res_allnumber"><li class="res_number">27</li><li class="res_number">30</li><li class="res_number">43</li><li class="res_number">71</li><li class="res_number" style="background-color: #267FFF; color: #fff;">63</li><li class="res_eng_letter">E</li></ul>
<ul class="res_allnumber"><li class="res_number">19</li><li class="res_number">20</li><li class="res_number">31</li><li class="res_number">43</li><li class="res_number" style="background-color: #971B7E; color: #fff;">61</li><li class="res_eng_letter">I</li></ul>
<ul class="res_allnumber"><li class="res_number">24</li><li class="res_number">41</li><li class="res_number">47</li><li class="res_number">72</li><li class="res_number" style="background-color: #016B21; color: #fff;">32</li><li class="res_eng_letter">K</li></ul>
<ul class="res_allnumber"><li class="res_number">13</li><li class="res_number">51</li><li class="res_number">61</li><li class="res_number">65</li><li class="res_number" style="background-color: #267FFF; color: #fff;">48</li><li class="res_eng_letter">E</li></ul>

Upvotes: 1

user1211
user1211

Reputation: 1515

I tried replicating your use-case. It seems the data is not be loaded in the page and the python code has already made a request. As a result, the "tbody" and its content is empty.

I did confirm by downloading the HTML file

fh = open('sample.html','w')      
fh.write(data)      
fh.close() 

There are a couple of solutions mentioned on the web to resolve this issue:

  1. Using the Python library called dryscrape. The details are mentioned Web-scraping JavaScript page with Python

  2. Using selenium:

from selenium import webdriver
import time
driver = webdriver.Firefox(executable_path = 'geckodriver.exe')
driver.get("http://www.dlb.today/result/en")
time.sleep(5)
htmlSource = driver.page_source

Download geckodriver from here. Further you can use htmlsource as an input to BeautifulSoup

Upvotes: 1

Related Questions