Reputation: 984
Need some help here. Plan to extract all the statistical data of this site https://lotostats.ro/toate-rezultatele-win-for-life-10-20
My issue is that I am not able to read the table. I can't do this nor for the first page.
Can someone pls help?
import requests
import lxml.html as lh
import pandas as pd
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
url='https://lotostats.ro/toate-rezultatele-win-for-life-10-20'
#Create a handle, page, to handle the contents of the website
page = requests.get(url)
#Store the contents of the website under doc
doc = lh.fromstring(page.content)
#Parse data that are stored between <tr>..</tr> of HTML
tr_elements = doc.xpath('//tr')
#Create empty list
col=[]
i=0
#For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
i+=1
name=t.text_content()
print ('%d:"%s"'%(i,name))
col.append((name,[]))
#Since out first row is the header, data is stored on the second row onwards
for j in range(1,len(tr_elements)):
#T is our j'th row
T=tr_elements[j]
#If row is not of size 10, the //tr data is not from our table
# if len(T)!=10:
# break
#i is the index of our column
i=0
#Iterate through each element of the row
for t in T.iterchildren():
data=t.text_content()
#Check if row is empty
if i>0:
#Convert any numerical value to integers
try:
data=int(data)
except:
pass
#Append the data to the empty list of the i'th column
col[i][1].append(data)
#Increment i for the next column
i+=1
Dict={title:column for (title,column) in col}
df=pd.DataFrame(Dict)
df.head()
print(df)
Upvotes: 1
Views: 178
Reputation: 84475
Data is dynamically added. You can find the source, returning json, in network tab
import requests
r = requests.get('https://lotostats.ro/all-rez/win_for_life_10_20?draw=1&columns%5B0%5D%5Bdata%5D=0&columns%5B0%5D%5Bname%5D=&columns%5B0%5D%5Bsearchable%5D=true&columns%5B0%5D%5Borderable%5D=false&columns%5B0%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B0%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B1%5D%5Bdata%5D=1&columns%5B1%5D%5Bname%5D=&columns%5B1%5D%5Bsearchable%5D=true&columns%5B1%5D%5Borderable%5D=false&columns%5B1%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B1%5D%5Bsearch%5D%5Bregex%5D=false&start=0&length=20&search%5Bvalue%5D=&search%5Bregex%5D=false&_=1564996040879').json()
You can decode that and likely (investigate that) remove timestamp part (or simply replace with random number)
import requests
r = requests.get('https://lotostats.ro/all-rez/win_for_life_10_20?draw=1&columns[0][data]=0&columns[0][name]=&columns[0][searchable]=true&columns[0][orderable]=false&columns[0][search][value]=&columns[0][search][regex]=false&columns[1][data]=1&columns[1][name]=&columns[1][searchable]=true&columns[1][orderable]=false&columns[1][search][value]=&columns[1][search][regex]=false&start=0&length=20&search[value]=&search[regex]=false&_=1').json()
To see the lottery lines:
print(r['data'])
The draw
parameter seems to be related to page of draws e.g. 2nd page:
https://lotostats.ro/all-rez/win_for_life_10_20?draw=2&columns[0][data]=0&columns[0][name]=&columns[0][searchable]=true&columns[0][orderable]=false&columns[0][search][value]=&columns[0][search][regex]=false&columns[1][data]=1&columns[1][name]=&columns[1][searchable]=true&columns[1][orderable]=false&columns[1][search][value]=&columns[1][search][regex]=false&start=20&length=20&search[value]=&search[regex]=false&_=1564996040880
You can alter the length
to retrieve more results. For example, I can deliberately oversize it to get all results
import requests
r = requests.get('https://lotostats.ro/all-rez/win_for_life_10_20?draw=1&columns[0][data]=0&columns[0][name]=&columns[0][searchable]=true&columns[0][orderable]=false&columns[0][search][value]=&columns[0][search][regex]=false&columns[1][data]=1&columns[1][name]=&columns[1][searchable]=true&columns[1][orderable]=false&columns[1][search][value]=&columns[1][search][regex]=false&start=0&length=100000&search[value]=&search[regex]=false&_=1').json()
print(len(r['data']))
Otherwise, you can set the length
param to a set number, do an initial request, and calculate the number of pages from the total (r['recordsFiltered']
) records count divided by results per page.
import math
total_results = r['recordsFiltered']
results_per_page = 20
num_pages = math.ceil(total_results/results_per_page)
Then do a loop to get all results (remembering to alter draw
param). Obviously, the less requests the better.
Upvotes: 2