user12269799
user12269799

Reputation:

Scraping html data from a web site with <li> tags

I am trying to to get data from this lottery website: https://www.lotterycorner.com/tx/lotto-texas/2019

The data I would like scrape is the dates and the winning numbers for 2017 to 2019. Then I would like to convert the data into a list and save to a csv file or excel file.

I do apologize if my question isn't understandable i am new to python. Here is a code I tried, but I don't know what to do after this

page = requests.get('https://www.lotterycorner.com/tx/lotto-texas/2017')    
soup = BeautifulSoup(page.content,'html.parser')    
week = soup.find(class_='win-number-table row no-brd-reduis')    
dates = (week.find_all(class_='win-nbr-date col-sm-3 col-xs-4'))    
wn = (week.find_all(class_='nbr-grp'))

I would like my result to be something like this:

enter image description here

Upvotes: 2

Views: 783

Answers (5)

chitown88
chitown88

Reputation: 28630

Don't use BeautifulSoup if there are table tags. It's much easier to let Pandas do the work for you (it uses BeautifulSoup to parse tables under the hood).

import pandas as pd

years = [2017, 2018, 2019]

df = pd.DataFrame()
for year in years:
    url = 'https://www.lotterycorner.com/tx/lotto-texas/%s' %year
    table = pd.read_html(url)[0][1:]
    win_nums = table.loc[:,1].str.split(" ",expand=True).reset_index(drop=True)
    dates = pd.DataFrame(list(table.loc[:,0]), columns=['date'])

    table = dates.merge(win_nums, left_index=True, right_index=True)

    df = df.append(table, sort=True).reset_index(drop=True) 

df['date']= pd.to_datetime(df['date']) 
df = df.sort_values('date').reset_index(drop=True)

df.to_csv('file.csv', index=False, header=False)

Output:

print (df)
          date   0   1   2   3   4   5
0   2017-01-04   5   7  36  39  40  44
1   2017-01-07   2   5  14  18  26  27
2   2017-01-11   4  13  16  19  43  51
3   2017-01-14   7   8  10  18  47  48
4   2017-01-18   6  11  17  37  40  49
5   2017-01-21   2  13  17  39  41  46
6   2017-01-25   1  14  19  32  37  46
7   2017-01-28   5   7  30  48  51  52
8   2017-02-01  12  19  26  29  37  54
9   2017-02-04   8  13  19  25  26  29
10  2017-02-08  10  15  47  49  51  52
11  2017-02-11  24  25  26  29  41  53
12  2017-02-15   1   4   5  43  53  54
13  2017-02-18   5  11  14  21  38  44
14  2017-02-22   4   8  21  27  52  53
15  2017-02-25  16  37  42  46  49  54
16  2017-03-01   3  24  33  34  45  51
17  2017-03-04   2   4   5  17  48  50
18  2017-03-08  15  19  24  33  34  47
19  2017-03-11   5   6  24  28  29  37
20  2017-03-15   4  11  19  27  32  46
21  2017-03-18  12  15  16  23  38  43
22  2017-03-22   3   5  15  27  36  52
23  2017-03-25  21  25  27  30  36  48
24  2017-03-29   7   9  11  18  23  43
25  2017-04-01   3  21  28  33  38  52
26  2017-04-05   8  20  21  26  51  52
27  2017-04-08  10  11  12  47  48  52
28  2017-04-12   5  26  30  31  46  54
29  2017-04-15   2  11  36  40  42  53
..         ...  ..  ..  ..  ..  ..  ..
265 2019-07-20   3  35  38  45  50  51
266 2019-07-24   2   9  16  22  46  49
267 2019-07-27   1   2   6   8  20  53
268 2019-07-31  20  24  34  36  41  44
269 2019-08-03   6  17  18  20  26  34
270 2019-08-07   1   3  16  22  31  35
271 2019-08-10  18  19  27  36  48  52
272 2019-08-14  22  23  29  36  39  49
273 2019-08-17  14  18  21  23  40  44
274 2019-08-21  18  28  29  36  48  52
275 2019-08-24  11  31  42  48  50  52
276 2019-08-28   9  21  40  42  49  53
277 2019-08-31   5   7  30  41  44  54
278 2019-09-04   4  26  36  37  45  50
279 2019-09-07  22  23  31  33  40  42
280 2019-09-11   8  11  12  30  31  49
281 2019-09-14   1   3  24  28  31  41
282 2019-09-18   3  24  26  29  45  50
283 2019-09-21   2  20  31  43  45  54
284 2019-09-25   5   9  26  38  41  44
285 2019-09-28  16  18  39  45  49  54
286 2019-10-02   9  26  39  42  47  49
287 2019-10-05   6  10  18  24  32  37
288 2019-10-09  14  18  19  27  33  41
289 2019-10-12   3  11  15  29  44  49
290 2019-10-16  12  15  25  39  46  49
291 2019-10-19  19  29  41  46  50  51
292 2019-10-23   4   5  11  35  44  50
293 2019-10-26   1   2  26  41  42  54
294 2019-10-30  10  11  28  31  40  53

[295 rows x 7 columns]

Upvotes: 1

QHarr
QHarr

Reputation: 84465

Here is a concise way with bs4 4.7.1+ that uses :not to exclude header and zip to combine columns for output. Results are as on page. Session is used for efficiency of tcp connection re-use.

import requests, re, csv
from bs4 import BeautifulSoup as bs

dates = []; winning_numbers = []

with requests.Session() as s:
    for year in range(2017, 2020):
        r = s.get(f'https://www.lotterycorner.com/tx/lotto-texas/{year}')
        soup = bs(r.content)
        dates.extend([i.text for i in soup.select('.win-nbr-date:not(.blue-bg)')])
        winning_numbers.extend([re.sub('\s+','-',i.text.strip()) for i in soup.select('.nbr-list')])

with open("lottery.csv", "w", encoding="utf-8-sig", newline='') as csv_file:
    w = csv.writer(csv_file, delimiter = ",", quoting=csv.QUOTE_MINIMAL)
    w.writerow(['date','numbers'])
    for row in zip(dates, winning_numbers):
        w.writerow(row)

Upvotes: 1

0buz
0buz

Reputation: 3503

This should export the data you need in a csv file:

from bs4 import BeautifulSoup
from csv import writer
import requests


page = requests.get('https://www.lotterycorner.com/tx/lotto-texas/2019')

soup = BeautifulSoup(page.content,'html.parser')

header = {
    'date': 'win-nbr-date col-sm-3 col-xs-4',
    'winning numbers': 'nbr-grp',
    'jackpot': 'win-nbr-jackpot col-sm-3 col-xs-3',
}

table = []

for header_key, header_value in header.items():
    items = soup.find_all(class_=f"{header_value}")
    column = [','.join(item.get_text().split()) if header_key=='winning numbers'
                       else ''.join(item.get_text().split()) if header_key == 'jackpot'
    else item.get_text() for item in items]
    table.append(column)

rows = list(zip(*table))

with open("winning numbers.csv", "w") as f:
    csv_writer = writer(f)
    csv_writer.writerow(header)
    for row in rows:
        csv_writer.writerow(row)

header is a dictionary mapping what will be your csv headers to their html class values

In the for loop we're building up the data per column. Some special handling was required for "winning numbers" and "jackpot", where I'm replacing any whitespace/hidden characters with comma/empty string.

Each column will be added to a list called table. We write everything in a csv file, but as csv writes one row at a time, we need to prepare our rows using the zip function (rows = list(zip(*table)))

Upvotes: 1

Sers
Sers

Reputation: 12255

Code below create csv files by year with data with all headers and values, in example below will be 3 files: data_2017.csv, data_2018.csv and data_2019.csv.
You can add another year to years = ['2017', '2018', '2019'] if needed.
Winning Numbers formatted to be as 1-2-3-4-5.

from bs4 import BeautifulSoup
import requests
import pandas as pd

base_url = 'https://www.lotterycorner.com/tx/lotto-texas/'
years = ['2017', '2018', '2019']

with requests.session() as s:
    for year in years:
        data = []

        page = requests.get(f'https://www.lotterycorner.com/tx/lotto-texas/{year}')
        soup = BeautifulSoup(page.content, 'html.parser')
        rows = soup.select(".win-number-table tr")

        headers = [td.text.strip() for td in rows[0].find_all("td")]
        # remove header line
        del rows[0]
        for row in rows:
            td = [td.text.strip() for td in row.select("td")]
            # replace whitespaces in Winning Numbers with -
            td[headers.index("Winning Numbers")] = '-'.join(td[headers.index("Winning Numbers")].split())
            data.append(td)

        df = pd.DataFrame(data, columns=headers)
        df.to_csv(f'data_{year}')

To save only Winning Numbers, replace df.to_csv(f'data_{year}') with:

df.to_csv(f'data_{year}', columns=["Winning Numbers"], index=False, header=False)

Example output for 2017, only Winning Numbers, no header:

9-14-16-27-45-51
2-4-15-38-48-53
8-22-23-29-34-36
6-10-11-22-30-45
5-10-16-22-26-46
12-14-19-34-39-47
4-5-10-21-34-40
1-25-35-42-48-51

Upvotes: 1

Vityata
Vityata

Reputation: 43593

This one works:

import requests
from bs4 import BeautifulSoup
import io
import re

def main():
    page = requests.get('https://www.lotterycorner.com/tx/lotto-texas/2018')
    soup = BeautifulSoup(page.content,'html.parser')
    week = soup.find(class_='win-number-table row no-brd-reduis')
    wn = (week.find_all(class_='nbr-grp'))
    file = open ("vit.txt","w+")
    for winning_number in wn:
        line = remove_html_tags(str(winning_number.contents).strip('[]'))
        line = line.replace(" ", "")
        file.write(line + "\n")
    file.close()

def remove_html_tags(text):
    import re
    clean = re.compile('<.*?>')
    return re.sub(clean, '', text)

This part of the code loops through the wn variable and writes every line to the "vit.txt" file: for winning_number in wn: line = remove_html_tags(str(winning_number.contents).strip('[]')) line = line.replace(" ", "") file.write(line + "\n") file.close()

The "stripping" of the <li> tags could be probably done better, e.g. there should be an elegant way to save the winning_number to a list and print the list with 1 line.

Upvotes: 0

Related Questions