Reputation: 137
I am trying to scrape data from:- https://www.canadapharmacy.com/
below are a few pages that I need to scrape:-
https://www.canadapharmacy.com/products/abilify-tablet,
https://www.canadapharmacy.com/products/accolate,
https://www.canadapharmacy.com/products/abilify-mt
I need all the information from the page. I wrote the below code:-
Using Soup:-
base_url = 'https://www.canadapharmacy.com'
data = []
for i in tqdm(range(len(test))):
r = requests.get(base_url+test[i])
soup = BeautifulSoup(r.text,'lxml')
# Scraping medicine Name
try:
main_name = (soup.find('h1',{"class":"mn"}).text.lstrip()).rstrip()
except:
main_name = None
try:
sec_name = ((soup.find('div',{"class":"product-name"}).find('h3').text.lstrip()).rstrip()).replace('\n','')
except:
sec_name = None
try:
generic_name = (soup.find('div',{"class":"card product generic strength equal"}).find('div').find('h3').text.lstrip()).rstrip()
except:
generic_name = None
# Description
card = ''.join([x.get_text(' ',strip=True) for x in soup.select('div.answer.expanded')])
try:
des = card.split('Directions')[0].replace('Description','')
except:
des = None
try:
drc = card.split('Directions')[1].split('Ingredients')[0]
except:
drc = None
try:
ingre= card.split('Directions')[1].split('Ingredients')[1].split('Cautions')[0]
except:
ingre = None
try:
cau=card.split('Directions')[1].split('Ingredients')[1].split('Cautions')[1].split('Side Effects')[0]
except:
cau = None
try:
se= card.split('Directions')[1].split('Ingredients')[1].split('Cautions')[1].split('Side Effects')[1]
except:
se = None
for j in soup.find('div',{"class":"answer expanded"}).find_all('h4'):
if 'Product Code' in j.text:
prod_code = j.text
#prod_code = soup.find('div',{"class":"answer expanded"}).find_all('h4')[5].text #//div[@class='answer expanded']//h4
pharma = {"primary_name":main_name,
"secondary_name":sec_name,
"Generic_Name":generic_name,
'Description':des,
'Directions':drc,
'Ingredients':ingre,
'Cautions':cau,
'Side Effects':se,
"Product_Code":prod_code}
data.append(pharma)
Using Selenium:-
main_name = []
sec_name = []
generic_name = []
strength = []
quantity = []
desc = []
direc = []
ingre = []
cau = []
side_effect = []
prod_code = []
for i in tqdm(range(len(test_url))):
card = []
driver.get(base_url+test_url[i])
time.sleep(1)
try:
main_name.append(driver.find_element(By.XPATH,"//div[@class='card product brand strength equal']//h3").text)
except:
main_name.append(None)
try:
sec_name.append(driver.find_element(By.XPATH,"//div[@class='card product generic strength equal']//h3").text)
except:
sec_name.append(None)
try:
generic_name.append(driver.find_element(By.XPATH,"//div[@class='card product generic strength equal']//h3").text)
except:
generic_name.append(None)
try:
for i in driver.find_elements(By.XPATH,"//div[@class='product-content']//div[@class='product-select']//form"):
strength.append(i.text)
except:
strength.append(None)
# try:
# for i in driver.find_elements(By.XPATH,"//div[@class='product-select']//form//div[@class='product-select-options'][2]"):
# quantity.append(i.text)
# except:
# quantity.append(None)
card.append(driver.find_element(By.XPATH,"//div[@class='answer expanded']").text)
try:
desc.append(card[0].split('Directions')[0].replace('Description',''))
except:
desc.append(None)
try:
direc.append(card[0].split('Directions')[1].split('Ingredients')[0])
except:
direc.append(None)
try:
ingre.append(card[0].split('Directions')[1].split('Ingredients')[1].split('Cautions')[0])
except:
ingre.append(None)
try:
cau.append(card[0].split('Directions')[1].split('Ingredients')[1].split('Cautions')[1].split('Side Effects')[0])
except:
cau.append(None)
try:
#side_effect.append(card.split('Directions')[1].split('Ingredients')[1].split('Cautions')[1].split('Side Effects')[1])
side_effect.append(card[0].split('Cautions')[1].split('Side Effects')[1])
except:
side_effect.append(None)
for j in driver.find_elements(By.XPATH,"//div[@class='answer expanded']//h4"):
if 'Product Code' in j.text:
prod_code.append(j.text)
I am able to scrap the data from the pages but facing an issue while scraping the Strength and quantity box. I want to write the code in such a manner so that I could scrape the data from every medicine separately and convert it data frame with columns like 2mg, 5mg, 10mg , 30 tablets, 90 tablets and shows prices.
I tried this code:-
medicine_name1 = []
medicine_name2 = []
strength = []
quantity = []
for i in tqdm(range(len(test_url))):
driver.get(base_url+test_url[i])
time.sleep(1)
try:
name1 = driver.find_element(By.XPATH,"//div[@class='card product brand strength equal']//h3").text
except:
name1 = None
try:
name2 = driver.find_element(By.XPATH,"//div[@class='card product generic strength equal']//h3").text
except:
name2 = None
try:
for i in driver.find_elements(By.XPATH,"//div[@class='product-select']//form//div[@class='product-select-options'][1]"):
strength.append(i.text)
medicine_name1.append(name1)
medicine_name2.append(name2)
except:
strength.append(None)
try:
for i in driver.find_elements(By.XPATH,"//div[@class='product-select']//form//div[@class='product-select-options'][2]"):
quantity.append(i.text)
except:
quantity.append(None)
It works fine but still, here I am getting repeated values for the medicine. Could anyone please check?
Upvotes: 0
Views: 59
Reputation: 4710
Note: it's usually more reliable to build a list of dictionaries [rather than separate lists like you are in the selenium version.]
Without a sample/mockup of your desired output, I can't be sure this is the exact format you'd want it in, but I'd suggest something like this solution using requests+bs4 [on the 3 links you includes as example]
# import requests
# from bs4 import BeautifulSoup
rootUrl = 'https://www.canadapharmacy.com'
prodList = ['abilify-tablet', 'accolate', 'abilify-mt']
priceList = []
for prod in prodList:
prodUrl = f'{rootUrl}/products/{prod}'
print('', end=f'Scraping {prodUrl} ')
resp = requests.get(prodUrl)
if resp.status_code != 200:
print(f'{resp.raise_for_status()} - failed to get {prodUrl}')
continue
pSoup = BeautifulSoup(resp.content)
pNameSel = 'div.product-name > h3'
for pv in pSoup.select(f'div > div.card.product:has({pNameSel})'):
pName = pv.select_one(pNameSel).get_text('\n').strip().split('\n')[0]
pDet = {'product_endpt': prod, 'product_name': pName.strip()}
brgen = pv.select_one('div.badge-container > div.badge')
if brgen: pDet['brand_or_generic'] = brgen.get_text(' ').strip()
rxReq = pv.select_one(f'{pNameSel} p.mn')
if rxReq: pDet['rx_requirement'] = rxReq.get_text(' ').strip()
mgSel = 'div.product-select-options'
opSel = 'option[value]:not([value=""])'
opSel = f'{mgSel} + {mgSel} select[name="productsizeId"] {opSel}'
for pvRow in pv.select(f'div.product-select-options-row:has({opSel})'):
pvrStrength = pvRow.select_one(mgSel).get_text(' ').strip()
pDet[pvrStrength] = ', '.join([
pvOp.get_text(' ').strip() for pvOp in pvRow.select(opSel)
])
pDet['source_url'] = prodUrl
priceList.append(pDet)
print(f' [total {len(priceList)} product prices]')
and then to display as table:
# import pandas
pricesDf = pandas.DataFrame(priceList).set_index('product_name')
colOrder = sorted(pricesDf.columns, key=lambda c: c == 'source_url')
pricesDf = pricesDf[colOrder] # (just to push 'source_url' to the end)
You could also get separate columns for each tablet-count-option, if you remove
pDet[pvrStrength] = ', '.join([
pvOp.get_text(' ').strip() for pvOp in pvRow.select(opSel)
])
and replace it with this loop:
for pvoi, pvOp in enumerate(pvRow.select(opSel)):
pvoTxt = pvOp.get_text(' ').strip()
tabletCt = pvoTxt.split(' - ')[0]
pvoPrice = pvoTxt.split(' - ')[-1]
if not tabletCt.endswith(' tablets'):
tabletCt = f'[option {pvoi + 1}]'
pvoPrice = pvoTxt
pDet[f'{pvrStrength} - {tabletCt}'] = pvoPrice
index | Abilify (Aripiprazole) | Generic Equivalent - Abilify (Aripiprazole) | Generic Equivalent - Accolate (Zafirlukast) | Abilify ODT (Aripiprazole) | Generic Equivalent - Abilify ODT (Aripiprazole) |
---|---|---|---|---|---|
product_endpt | abilify-tablet | abilify-tablet | accolate | abilify-mt | abilify-mt |
brand_or_generic | Brand | Generic | Generic | Brand | Generic |
rx_requirement | Prescription Required | NaN | NaN | Prescription Required | NaN |
2mg - 30 tablets | $219.99 | NaN | NaN | NaN | NaN |
2mg - 90 tablets | $526.99 | NaN | NaN | NaN | NaN |
5mg - 28 tablets | $160.99 | NaN | NaN | NaN | NaN |
5mg - 84 tablets | $459.99 | NaN | NaN | NaN | NaN |
10mg - 28 tablets | $116.99 | NaN | NaN | NaN | NaN |
10mg - 84 tablets | $162.99 | NaN | NaN | NaN | NaN |
15mg - 28 tablets | $159.99 | NaN | NaN | NaN | NaN |
15mg - 84 tablets | $198.99 | NaN | NaN | NaN | NaN |
20mg - 90 tablets | $745.99 | $67.99 | NaN | NaN | NaN |
30mg - 28 tablets | $104.99 | NaN | NaN | NaN | NaN |
30mg - 84 tablets | $289.99 | $75.99 | NaN | NaN | NaN |
1mg/ml Solution - [option 1] | 150 ml - $239.99 | NaN | NaN | NaN | NaN |
2mg - 100 tablets | NaN | $98.99 | NaN | NaN | NaN |
5mg - 100 tablets | NaN | $43.99 | NaN | NaN | NaN |
10mg - 90 tablets | NaN | $38.59 | NaN | NaN | NaN |
15mg - 90 tablets | NaN | $56.59 | NaN | NaN | NaN |
10mg - 60 tablets | NaN | NaN | $109.00 | NaN | NaN |
20mg - 60 tablets | NaN | NaN | $109.00 | NaN | NaN |
10mg ODT - 84 tablets | NaN | NaN | NaN | $499.99 | NaN |
15mg ODT - 84 tablets | NaN | NaN | NaN | $499.99 | NaN |
5mg ODT - 90 tablets | NaN | NaN | NaN | NaN | $59.00 |
20mg ODT - 90 tablets | NaN | NaN | NaN | NaN | $89.00 |
30mg ODT - 150 tablets | NaN | NaN | NaN | NaN | $129.99 |
source_url | https://www.canadapharmacy.com/products/abilify-tablet | https://www.canadapharmacy.com/products/abilify-tablet | https://www.canadapharmacy.com/products/accolate | https://www.canadapharmacy.com/products/abilify-mt | https://www.canadapharmacy.com/products/abilify-mt |
(I transposed the table since there were so many columns and so few rows. Table markdown can be copied from output of print(pricesDf.T.to_markdown())
)
Upvotes: 1