John_Muir
John_Muir

Reputation: 119

Scraping locations : selecting 'li' tags from table and parsing into excel

I am currently trying to scrape store location for research project that is aiming to show the effect covid had on different retailers. The retailer I am having issue with currently is " The source". It's a Canadian retailer that has a large amount of store across Canada and has store that are generally small when compared to Best Buy. The store locator page is: https://www.thesource.ca/en-ca/store-finder

The goal for this code is to have a excel file with column of address, postal code and phone number.( I just assume use pandas for this) Those three are also the data I wanna scrape. The code I wrote so far I think is on the right track, the information for the most part is under a table. However I am struggling to get to the 'li' tags and it to loop through the different rows of table. If anyone has a idea on how I would grab the 'li' tags for each of data I want that would be great!

import requests

from bs4 import BeautifulSoup 

url = 'https://www.thesource.ca/en-ca/store-finder'

r = requests.get(url)

soup = BeautifulSoup (r,text,'htmlparser')


Locations_table = soup.find('table', class_='storeResultList store-result-list desktop-only')

for locations in Locations_table.find_all('tbody'):
    rows = locations.('tr', class_= 'storeItem store-result-row')
    for row in rows:
        address = row.find('td', class_ ='address')
       # trying to get address
        # postal
        # phone number which I think is not under this table 

print(Locations_table)

Just showing how the 'li' tags are

Upvotes: 1

Views: 123

Answers (3)

Ajax1234
Ajax1234

Reputation: 71471

You are close: each row object produced by iterating over BeautifulSoup.select('tr.storeItem.store-result-row') can be further selected from to get the li values. In the solution below, a function is used to take in each row and extract the results:

import requests, pandas as pd
from bs4 import BeautifulSoup as soup
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0'}
d = soup(requests.get('https://www.thesource.ca/en-ca/store-finder', headers=headers).text, 'html.parser')
def store_info(row):
   return {'store':row.select_one('td.address .itemName').get_text(strip=True),
           'address':', '.join((j:=list(filter(None, [i.text for i in row.select('td.address ul li')])))[:-1]),
           'postal_code':j[-1],
           'phone':row.select_one('td.address .tel-link').get_text(strip=True)}

results = [store_info(row) for row in d.select('table:nth-of-type(1) tr.storeItem.store-result-row')]
df = pd.DataFrame(results)

Output:

                   store                                           address postal_code         phone
0        Optimist Square  4725 Dorchester Rd, Unit #B10, NIAGARA FALLS, ON     L2E 0A8  905-356-0772
1            SEAWAY MALL          800 NIAGARA ST N, UNIT #K12, WELLAND, ON      L3C5Z4  905-735-2136
2             PEN CENTRE               221 GLENDALE AVE, ST CATHARINES, ON      L2T2K9  905-684-1456
3      GRIMSBY SQUARE SC      44 Livingston Ave., Unit #1006A, GRIMSBY, ON      L3M1L1  905-945-9415
4      J & R  SPORTS LTD                       151 QUEEN ST, DUNNVILLE, ON      N1A1H6  905-774-8872
..                   ...                                               ...         ...           ...
95    KINGSVILLE MAIN ST         410 MAIN  ST E, UNIT #3/4, KINGSVILLE, ON     N9Y 1A7  519-733-4138
96  ST. CLAIR SHORES S/C         25 AMY CROFT DRIVE, UNIT #15, WINDSOR, ON      N9K1C7  519-735-5364
97         TECUMSEH MALL                D2-7650 TECUMSEH RD E, WINDSOR, ON      N8T1E9  519-974-1421
98       DEVONSHIRE MALL           3100 HOWARD AVE, UNIT #SS5, WINDSOR, ON      N8X3Y8  519-969-2099
99           PLAYIT STAR               105 HENRY STREET WEST, PRESCOTT, ON      K0E1T0  613-925-0776

[100 rows x 4 columns]

Upvotes: 0

We are coding according to logic! Once you've a logic so you can parse towards it!

Logic here is that almost of addresses length is 6, where the messed addresses length is 5. so we can clear it up.

import requests
from bs4 import BeautifulSoup
import pandas as pd
from more_itertools import collapse

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0'
}


def main(url):
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.text, 'lxml')
    goal = [list(x.stripped_strings)
            for x in soup.select_one('.storeResultList').select('.address')[1:]]

    allin = []
    for x in goal:
        if len(x) == 5:
            x.insert(2, 'N/A')
        x[3] = x[3].rsplit(",", 1)
        allin.append(list(collapse(x)))

    df = pd.DataFrame(
        allin, columns=["Name", "Address", "Unit", "City", "State", "Zip", "Phone"])
    df.to_csv('data.csv', index=False)


main('https://www.thesource.ca/en-ca/store-finder')

Output:

                    Name                Address         Unit           City State      Zip         Phone
0        Optimist Square     4725 Dorchester Rd    Unit #B10  NIAGARA FALLS    ON  L2E 0A8  905-356-0772        
1            SEAWAY MALL       800 NIAGARA ST N    UNIT #K12        WELLAND    ON   L3C5Z4  905-735-2136        
2             PEN CENTRE       221 GLENDALE AVE          N/A  ST CATHARINES    ON   L2T2K9  905-684-1456        
3      GRIMSBY SQUARE SC     44 Livingston Ave.  Unit #1006A        GRIMSBY    ON   L3M1L1  905-945-9415        
4      J & R  SPORTS LTD           151 QUEEN ST          N/A      DUNNVILLE    ON   N1A1H6  905-774-8872        
..                   ...                    ...          ...            ...   ...      ...           ...        
95    KINGSVILLE MAIN ST         410 MAIN  ST E    UNIT #3/4     KINGSVILLE    ON  N9Y 1A7  519-733-4138        
96  ST. CLAIR SHORES S/C     25 AMY CROFT DRIVE     UNIT #15        WINDSOR    ON   N9K1C7  519-735-5364        
97         TECUMSEH MALL  D2-7650 TECUMSEH RD E          N/A        WINDSOR    ON   N8T1E9  519-974-1421        
98       DEVONSHIRE MALL        3100 HOWARD AVE    UNIT #SS5        WINDSOR    ON   N8X3Y8  519-969-2099        
99           PLAYIT STAR  105 HENRY STREET WEST          N/A       PRESCOTT    ON   K0E1T0  613-925-0776        

[100 rows x 7 columns]

Upvotes: 1

MendelG
MendelG

Reputation: 20098

To select different li's you can use the :nth-of-type(n) CSS selector.

To use a CSS selector, use the select_one() method instead of .find().

Note:

  • I added the user-agent header since the page was stuck on loading.

In your example:

import requests
import pandas as pd
from bs4 import BeautifulSoup


url = "https://www.thesource.ca/en-ca/store-finder"

headers = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")

out = {"Address": [], "Postal": [], "Phone": []}

for tag in soup.select(".details"):
    out["Address"].append(tag.select_one("li:nth-of-type(1)").get_text(strip=True))
    out["Postal"].append(
        tag.select_one("li:last-of-type").get_text(strip=True)
    )
    out["Phone"].append(tag.select_one("a.tel-link").get_text(strip=True))


df = pd.DataFrame(out)
print(df.to_string())

Output (truncated):

                      Address   Postal         Phone
0           4725 Dorchester Rd  L2E 0A8  905-356-0772
1             800 NIAGARA ST N   L3C5Z4  905-735-2136
2             221 GLENDALE AVE   L2T2K9  905-684-1456
3           44 Livingston Ave.   L3M1L1  905-945-9415
4                 151 QUEEN ST   N1A1H6  905-774-8872

Upvotes: 0

Related Questions