Reputation: 119
I am currently trying to scrape store location for research project that is aiming to show the effect covid had on different retailers. The retailer I am having issue with currently is " The source". It's a Canadian retailer that has a large amount of store across Canada and has store that are generally small when compared to Best Buy. The store locator page is: https://www.thesource.ca/en-ca/store-finder
The goal for this code is to have a excel file with column of address, postal code and phone number.( I just assume use pandas for this) Those three are also the data I wanna scrape. The code I wrote so far I think is on the right track, the information for the most part is under a table. However I am struggling to get to the 'li' tags and it to loop through the different rows of table. If anyone has a idea on how I would grab the 'li' tags for each of data I want that would be great!
import requests
from bs4 import BeautifulSoup
url = 'https://www.thesource.ca/en-ca/store-finder'
r = requests.get(url)
soup = BeautifulSoup (r,text,'htmlparser')
Locations_table = soup.find('table', class_='storeResultList store-result-list desktop-only')
for locations in Locations_table.find_all('tbody'):
rows = locations.('tr', class_= 'storeItem store-result-row')
for row in rows:
address = row.find('td', class_ ='address')
# trying to get address
# postal
# phone number which I think is not under this table
print(Locations_table)
Upvotes: 1
Views: 123
Reputation: 71471
You are close: each row object produced by iterating over BeautifulSoup.select('tr.storeItem.store-result-row')
can be further select
ed from to get the li
values. In the solution below, a function is used to take in each row and extract the results:
import requests, pandas as pd
from bs4 import BeautifulSoup as soup
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0'}
d = soup(requests.get('https://www.thesource.ca/en-ca/store-finder', headers=headers).text, 'html.parser')
def store_info(row):
return {'store':row.select_one('td.address .itemName').get_text(strip=True),
'address':', '.join((j:=list(filter(None, [i.text for i in row.select('td.address ul li')])))[:-1]),
'postal_code':j[-1],
'phone':row.select_one('td.address .tel-link').get_text(strip=True)}
results = [store_info(row) for row in d.select('table:nth-of-type(1) tr.storeItem.store-result-row')]
df = pd.DataFrame(results)
Output:
store address postal_code phone
0 Optimist Square 4725 Dorchester Rd, Unit #B10, NIAGARA FALLS, ON L2E 0A8 905-356-0772
1 SEAWAY MALL 800 NIAGARA ST N, UNIT #K12, WELLAND, ON L3C5Z4 905-735-2136
2 PEN CENTRE 221 GLENDALE AVE, ST CATHARINES, ON L2T2K9 905-684-1456
3 GRIMSBY SQUARE SC 44 Livingston Ave., Unit #1006A, GRIMSBY, ON L3M1L1 905-945-9415
4 J & R SPORTS LTD 151 QUEEN ST, DUNNVILLE, ON N1A1H6 905-774-8872
.. ... ... ... ...
95 KINGSVILLE MAIN ST 410 MAIN ST E, UNIT #3/4, KINGSVILLE, ON N9Y 1A7 519-733-4138
96 ST. CLAIR SHORES S/C 25 AMY CROFT DRIVE, UNIT #15, WINDSOR, ON N9K1C7 519-735-5364
97 TECUMSEH MALL D2-7650 TECUMSEH RD E, WINDSOR, ON N8T1E9 519-974-1421
98 DEVONSHIRE MALL 3100 HOWARD AVE, UNIT #SS5, WINDSOR, ON N8X3Y8 519-969-2099
99 PLAYIT STAR 105 HENRY STREET WEST, PRESCOTT, ON K0E1T0 613-925-0776
[100 rows x 4 columns]
Upvotes: 0
Reputation: 11525
We are coding according to logic! Once you've a logic so you can parse towards it!
Logic here is that almost of addresses length is 6, where the messed addresses length is 5. so we can clear it up.
import requests
from bs4 import BeautifulSoup
import pandas as pd
from more_itertools import collapse
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0'
}
def main(url):
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'lxml')
goal = [list(x.stripped_strings)
for x in soup.select_one('.storeResultList').select('.address')[1:]]
allin = []
for x in goal:
if len(x) == 5:
x.insert(2, 'N/A')
x[3] = x[3].rsplit(",", 1)
allin.append(list(collapse(x)))
df = pd.DataFrame(
allin, columns=["Name", "Address", "Unit", "City", "State", "Zip", "Phone"])
df.to_csv('data.csv', index=False)
main('https://www.thesource.ca/en-ca/store-finder')
Output:
Name Address Unit City State Zip Phone
0 Optimist Square 4725 Dorchester Rd Unit #B10 NIAGARA FALLS ON L2E 0A8 905-356-0772
1 SEAWAY MALL 800 NIAGARA ST N UNIT #K12 WELLAND ON L3C5Z4 905-735-2136
2 PEN CENTRE 221 GLENDALE AVE N/A ST CATHARINES ON L2T2K9 905-684-1456
3 GRIMSBY SQUARE SC 44 Livingston Ave. Unit #1006A GRIMSBY ON L3M1L1 905-945-9415
4 J & R SPORTS LTD 151 QUEEN ST N/A DUNNVILLE ON N1A1H6 905-774-8872
.. ... ... ... ... ... ... ...
95 KINGSVILLE MAIN ST 410 MAIN ST E UNIT #3/4 KINGSVILLE ON N9Y 1A7 519-733-4138
96 ST. CLAIR SHORES S/C 25 AMY CROFT DRIVE UNIT #15 WINDSOR ON N9K1C7 519-735-5364
97 TECUMSEH MALL D2-7650 TECUMSEH RD E N/A WINDSOR ON N8T1E9 519-974-1421
98 DEVONSHIRE MALL 3100 HOWARD AVE UNIT #SS5 WINDSOR ON N8X3Y8 519-969-2099
99 PLAYIT STAR 105 HENRY STREET WEST N/A PRESCOTT ON K0E1T0 613-925-0776
[100 rows x 7 columns]
Upvotes: 1
Reputation: 20098
To select different li
's you can use the :nth-of-type(n)
CSS selector.
To use a CSS selector, use the select_one()
method instead of .find()
.
Note:
user-agent
header since the page was stuck on loading.In your example:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://www.thesource.ca/en-ca/store-finder"
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
out = {"Address": [], "Postal": [], "Phone": []}
for tag in soup.select(".details"):
out["Address"].append(tag.select_one("li:nth-of-type(1)").get_text(strip=True))
out["Postal"].append(
tag.select_one("li:last-of-type").get_text(strip=True)
)
out["Phone"].append(tag.select_one("a.tel-link").get_text(strip=True))
df = pd.DataFrame(out)
print(df.to_string())
Output (truncated):
Address Postal Phone
0 4725 Dorchester Rd L2E 0A8 905-356-0772
1 800 NIAGARA ST N L3C5Z4 905-735-2136
2 221 GLENDALE AVE L2T2K9 905-684-1456
3 44 Livingston Ave. L3M1L1 905-945-9415
4 151 QUEEN ST N1A1H6 905-774-8872
Upvotes: 0