Reputation: 83
I need to scrape a website which has a 'table' like paragraph and I want to put it into a pandas table on python. This is the website link: 'Website Link
I need to get the Name, Price and the description of the page and put it all in a DataFrame format. The problem is that I can scrape all of it individually, but I can't get them to a proper DataFrame.
Here is what I have done so far:
I get the product links first because I need to scrape multiple pages:
baseURL = 'https://www.civivi.com'
product_links = []
for x in range (1,3):
HTML = requests.get(f'https://www.civivi.com/collections/all-products/price-range_-70?page={x}',HEADER)
#HTML.status_code
Booti= soup(HTML.content, "lxml")
knife_items = Booti.find_all('div',class_= "product-list product-list--collection product-list--with-sidebar")
for items in knife_items:
for links in items.findAll('a', attrs = {'class' : 'product-item__image-wrapper product-item__image-wrapper--with-secondary'}, href = True):
product_links.append(baseURL + links['href'])
And then I scrape the individual web pages here:
Name = []
Price = []
Specific = []
for links in product_links:
#testlinks = "https://www.civivi.com/collections/all-products/products/civivi-rustic-gent-lockback-knife-c914-g10-d2"
HTML2 = requests.get(links, HEADER)
Booti2 = soup(HTML2.content,"html.parser")
try:
for N in Booti2.findAll('h1',{'class': "product-meta__title heading h1" }):
Name.append(N.text.replace('\n', '').strip())
for P in Booti2.findAll('span',{'class': "price" }):
Price.append(P.text.replace('\n', '').strip())
Contents = Booti2.find('div',class_= "rte text--pull")
for S in Contents.find_all('span'):
Specific.append(S.text)
except:
continue
So I need to get all the information in this format:
Name. | | Price || Model Number Model Name. Overall Length
|------------------| |----------------||-------------| ---------||----------------|
| Product Name 1 | | $$ || XXXX | ABC. || XX"/XXcm. |
| Product Name 2 | | $$ || XXXX | ABC. || XX"/XXcm. |
| Product Name 3 | | $$ || XXXX | ABC. || XX"/XXcm. |
| Product Name 4 | | $$ || XXXX | ABC. || XX"/XXcm. |
...and so on with rest of the columns from the web pages. Any help would be appreciated!! Thank you so much!!
Upvotes: 1
Views: 490
Reputation: 3989
One option is to use the find('p')
in the class rte text--pull"
and use then get_text
with a separator as argument (\n
). Then, use the following regular expressions (or split the text
variable, find the keyword and remove from the string) to get only the desired information. With the list rows
in place, you can create the dataframe with pd.DataFrame(rows)
.
import re # import regex to get knife model and length
rows = [] # create list to hold dataframe rows
for links in product_links:
HTML2 = requests.get(links)
Booti2 = soup(HTML2.content,"html.parser")
try:
name = Booti2.find('h1',{'class': "product-meta__title heading h1" }).get_text()
price = Booti2.find('span',{'class': "price" }).get_text()
Contents = Booti2.find('div',class_= "rte text--pull")
text = Contents.find('p').get_text(separator='\n')
model_num = re.search('Model Number: (.+?)\n', text).group(1)
model_name = re.search('Model Name: (.+?)\n', text).group(1)
overall_len = re.search('Overall Length: (.+?)\n', text).group(1)
rows.append([name, price, model_num, model_name, overall_len])
except:
continue
In case you haven't done already, import pandas as pd
.
import pandas as pd
df = pd.DataFrame(rows, columns=['name', 'price', 'model_num', 'model_name', 'overall_len'])
print(df)
name price model_num model_name overall_len
0 CIVIVI Altus Button Lock a... $85 C20076-1 Altus 7.12" / 180.8mm
1 CIVIVI Altus Button Lock a... $90 C20076-3 Altus 7.12" / 180.8mm
2 CIVIVI Altus Button Lock a... $107 C20076-DS1 Altus 7.12" / 180.8mm
3 CIVIVI Teton Tickler Fixed... $258.50 C20072-1 Teton Tickler 10.16" / 258mm
4 CIVIVI Nox Flipper Knife G... $76.50 C2110C NOx 6.80" / 172.7mm
...
...
40 CIVIVI Ortis Flipper Knife... $105 C2013DS-1 Ortis 7.48" / 190mm
41 CIVIVI Dogma Flipper Knife... $79.50 C2014A Dogma 7.7" / 195.7mm
42 CIVIVI Dogma Flipper Knife... $79.50 C2014B Dogma 7.7" / 195.7mm
43 CIVIVI Appalachian Drifter... $98 C2015A Appalachian Drifter 6.8" / 172.7mm
Upvotes: 1