Reputation:
I am trying to to get data from this lottery website: https://www.lotterycorner.com/tx/lotto-texas/2019
The data I would like scrape is the dates and the winning numbers for 2017 to 2019. Then I would like to convert the data into a list and save to a csv file or excel file.
I do apologize if my question isn't understandable i am new to python. Here is a code I tried, but I don't know what to do after this
page = requests.get('https://www.lotterycorner.com/tx/lotto-texas/2017')
soup = BeautifulSoup(page.content,'html.parser')
week = soup.find(class_='win-number-table row no-brd-reduis')
dates = (week.find_all(class_='win-nbr-date col-sm-3 col-xs-4'))
wn = (week.find_all(class_='nbr-grp'))
I would like my result to be something like this:
Upvotes: 2
Views: 783
Reputation: 28630
Don't use BeautifulSoup if there are table tags. It's much easier to let Pandas do the work for you (it uses BeautifulSoup to parse tables under the hood).
import pandas as pd
years = [2017, 2018, 2019]
df = pd.DataFrame()
for year in years:
url = 'https://www.lotterycorner.com/tx/lotto-texas/%s' %year
table = pd.read_html(url)[0][1:]
win_nums = table.loc[:,1].str.split(" ",expand=True).reset_index(drop=True)
dates = pd.DataFrame(list(table.loc[:,0]), columns=['date'])
table = dates.merge(win_nums, left_index=True, right_index=True)
df = df.append(table, sort=True).reset_index(drop=True)
df['date']= pd.to_datetime(df['date'])
df = df.sort_values('date').reset_index(drop=True)
df.to_csv('file.csv', index=False, header=False)
Output:
print (df)
date 0 1 2 3 4 5
0 2017-01-04 5 7 36 39 40 44
1 2017-01-07 2 5 14 18 26 27
2 2017-01-11 4 13 16 19 43 51
3 2017-01-14 7 8 10 18 47 48
4 2017-01-18 6 11 17 37 40 49
5 2017-01-21 2 13 17 39 41 46
6 2017-01-25 1 14 19 32 37 46
7 2017-01-28 5 7 30 48 51 52
8 2017-02-01 12 19 26 29 37 54
9 2017-02-04 8 13 19 25 26 29
10 2017-02-08 10 15 47 49 51 52
11 2017-02-11 24 25 26 29 41 53
12 2017-02-15 1 4 5 43 53 54
13 2017-02-18 5 11 14 21 38 44
14 2017-02-22 4 8 21 27 52 53
15 2017-02-25 16 37 42 46 49 54
16 2017-03-01 3 24 33 34 45 51
17 2017-03-04 2 4 5 17 48 50
18 2017-03-08 15 19 24 33 34 47
19 2017-03-11 5 6 24 28 29 37
20 2017-03-15 4 11 19 27 32 46
21 2017-03-18 12 15 16 23 38 43
22 2017-03-22 3 5 15 27 36 52
23 2017-03-25 21 25 27 30 36 48
24 2017-03-29 7 9 11 18 23 43
25 2017-04-01 3 21 28 33 38 52
26 2017-04-05 8 20 21 26 51 52
27 2017-04-08 10 11 12 47 48 52
28 2017-04-12 5 26 30 31 46 54
29 2017-04-15 2 11 36 40 42 53
.. ... .. .. .. .. .. ..
265 2019-07-20 3 35 38 45 50 51
266 2019-07-24 2 9 16 22 46 49
267 2019-07-27 1 2 6 8 20 53
268 2019-07-31 20 24 34 36 41 44
269 2019-08-03 6 17 18 20 26 34
270 2019-08-07 1 3 16 22 31 35
271 2019-08-10 18 19 27 36 48 52
272 2019-08-14 22 23 29 36 39 49
273 2019-08-17 14 18 21 23 40 44
274 2019-08-21 18 28 29 36 48 52
275 2019-08-24 11 31 42 48 50 52
276 2019-08-28 9 21 40 42 49 53
277 2019-08-31 5 7 30 41 44 54
278 2019-09-04 4 26 36 37 45 50
279 2019-09-07 22 23 31 33 40 42
280 2019-09-11 8 11 12 30 31 49
281 2019-09-14 1 3 24 28 31 41
282 2019-09-18 3 24 26 29 45 50
283 2019-09-21 2 20 31 43 45 54
284 2019-09-25 5 9 26 38 41 44
285 2019-09-28 16 18 39 45 49 54
286 2019-10-02 9 26 39 42 47 49
287 2019-10-05 6 10 18 24 32 37
288 2019-10-09 14 18 19 27 33 41
289 2019-10-12 3 11 15 29 44 49
290 2019-10-16 12 15 25 39 46 49
291 2019-10-19 19 29 41 46 50 51
292 2019-10-23 4 5 11 35 44 50
293 2019-10-26 1 2 26 41 42 54
294 2019-10-30 10 11 28 31 40 53
[295 rows x 7 columns]
Upvotes: 1
Reputation: 84465
Here is a concise way with bs4 4.7.1+ that uses :not to exclude header and zip to combine columns for output. Results are as on page. Session
is used for efficiency of tcp connection re-use.
import requests, re, csv
from bs4 import BeautifulSoup as bs
dates = []; winning_numbers = []
with requests.Session() as s:
for year in range(2017, 2020):
r = s.get(f'https://www.lotterycorner.com/tx/lotto-texas/{year}')
soup = bs(r.content)
dates.extend([i.text for i in soup.select('.win-nbr-date:not(.blue-bg)')])
winning_numbers.extend([re.sub('\s+','-',i.text.strip()) for i in soup.select('.nbr-list')])
with open("lottery.csv", "w", encoding="utf-8-sig", newline='') as csv_file:
w = csv.writer(csv_file, delimiter = ",", quoting=csv.QUOTE_MINIMAL)
w.writerow(['date','numbers'])
for row in zip(dates, winning_numbers):
w.writerow(row)
Upvotes: 1
Reputation: 3503
This should export the data you need in a csv file:
from bs4 import BeautifulSoup
from csv import writer
import requests
page = requests.get('https://www.lotterycorner.com/tx/lotto-texas/2019')
soup = BeautifulSoup(page.content,'html.parser')
header = {
'date': 'win-nbr-date col-sm-3 col-xs-4',
'winning numbers': 'nbr-grp',
'jackpot': 'win-nbr-jackpot col-sm-3 col-xs-3',
}
table = []
for header_key, header_value in header.items():
items = soup.find_all(class_=f"{header_value}")
column = [','.join(item.get_text().split()) if header_key=='winning numbers'
else ''.join(item.get_text().split()) if header_key == 'jackpot'
else item.get_text() for item in items]
table.append(column)
rows = list(zip(*table))
with open("winning numbers.csv", "w") as f:
csv_writer = writer(f)
csv_writer.writerow(header)
for row in rows:
csv_writer.writerow(row)
header is a dictionary mapping what will be your csv headers to their html class values
In the for loop we're building up the data per column. Some special handling was required for "winning numbers" and "jackpot", where I'm replacing any whitespace/hidden characters with comma/empty string.
Each column will be added to a list called table. We write everything in a csv file, but as csv writes one row at a time, we need to prepare our rows using the zip function (rows = list(zip(*table))
)
Upvotes: 1
Reputation: 12255
Code below create csv files by year with data with all headers and values, in example below will be 3 files: data_2017.csv, data_2018.csv and data_2019.csv.
You can add another year to years = ['2017', '2018', '2019']
if needed.
Winning Numbers formatted to be as 1-2-3-4-5.
from bs4 import BeautifulSoup
import requests
import pandas as pd
base_url = 'https://www.lotterycorner.com/tx/lotto-texas/'
years = ['2017', '2018', '2019']
with requests.session() as s:
for year in years:
data = []
page = requests.get(f'https://www.lotterycorner.com/tx/lotto-texas/{year}')
soup = BeautifulSoup(page.content, 'html.parser')
rows = soup.select(".win-number-table tr")
headers = [td.text.strip() for td in rows[0].find_all("td")]
# remove header line
del rows[0]
for row in rows:
td = [td.text.strip() for td in row.select("td")]
# replace whitespaces in Winning Numbers with -
td[headers.index("Winning Numbers")] = '-'.join(td[headers.index("Winning Numbers")].split())
data.append(td)
df = pd.DataFrame(data, columns=headers)
df.to_csv(f'data_{year}')
To save only Winning Numbers, replace df.to_csv(f'data_{year}')
with:
df.to_csv(f'data_{year}', columns=["Winning Numbers"], index=False, header=False)
Example output for 2017, only Winning Numbers, no header:
9-14-16-27-45-51
2-4-15-38-48-53
8-22-23-29-34-36
6-10-11-22-30-45
5-10-16-22-26-46
12-14-19-34-39-47
4-5-10-21-34-40
1-25-35-42-48-51
Upvotes: 1
Reputation: 43593
This one works:
import requests
from bs4 import BeautifulSoup
import io
import re
def main():
page = requests.get('https://www.lotterycorner.com/tx/lotto-texas/2018')
soup = BeautifulSoup(page.content,'html.parser')
week = soup.find(class_='win-number-table row no-brd-reduis')
wn = (week.find_all(class_='nbr-grp'))
file = open ("vit.txt","w+")
for winning_number in wn:
line = remove_html_tags(str(winning_number.contents).strip('[]'))
line = line.replace(" ", "")
file.write(line + "\n")
file.close()
def remove_html_tags(text):
import re
clean = re.compile('<.*?>')
return re.sub(clean, '', text)
This part of the code loops through the wn
variable and writes every line to the "vit.txt" file:
for winning_number in wn:
line = remove_html_tags(str(winning_number.contents).strip('[]'))
line = line.replace(" ", "")
file.write(line + "\n")
file.close()
The "stripping" of the <li>
tags could be probably done better, e.g. there should be an elegant way to save the winning_number
to a list and print the list with 1 line.
Upvotes: 0