Hamza Ahmed
Hamza Ahmed

Reputation: 83

Web Scraping no table HTML content element into Pandas Table

I need to scrape a website which has a 'table' like paragraph and I want to put it into a pandas table on python. This is the website link: 'Website Link

I need to get the Name, Price and the description of the page and put it all in a DataFrame format. The problem is that I can scrape all of it individually, but I can't get them to a proper DataFrame.

Here is what I have done so far:

I get the product links first because I need to scrape multiple pages:
baseURL = 'https://www.civivi.com'
product_links = []
for x in range (1,3):
    HTML = requests.get(f'https://www.civivi.com/collections/all-products/price-range_-70?page={x}',HEADER)
    #HTML.status_code
    Booti= soup(HTML.content, "lxml")
    knife_items = Booti.find_all('div',class_= "product-list product-list--collection product-list--with-sidebar")
    
    for items in knife_items:
        for links in items.findAll('a', attrs = {'class' : 'product-item__image-wrapper product-item__image-wrapper--with-secondary'}, href = True):
            product_links.append(baseURL + links['href'])

And then I scrape the individual web pages here:

Name = []
Price = []
Specific = []
for links in product_links:
#testlinks = "https://www.civivi.com/collections/all-products/products/civivi-rustic-gent-lockback-knife-c914-g10-d2"
    HTML2 = requests.get(links, HEADER)
    Booti2 = soup(HTML2.content,"html.parser") 
    try:
        for N in Booti2.findAll('h1',{'class': "product-meta__title heading h1" }):
            Name.append(N.text.replace('\n', '').strip())
        for P in Booti2.findAll('span',{'class': "price" }):
            Price.append(P.text.replace('\n', '').strip())
        Contents = Booti2.find('div',class_= "rte text--pull")
        for S in Contents.find_all('span'):
            Specific.append(S.text)

    except:
        continue 

So I need to get all the information in this format:

         Name.     | | Price          || Model Number  Model Name. Overall Length
|------------------| |----------------||-------------| ---------||----------------|
| Product Name 1   | |  $$            ||  XXXX       |  ABC.    ||   XX"/XXcm.    |  
| Product Name 2   | |  $$            ||  XXXX       |  ABC.    ||   XX"/XXcm.    |
| Product Name 3   | |  $$            ||  XXXX       |  ABC.    ||   XX"/XXcm.    | 
| Product Name 4   | |  $$            ||  XXXX       |  ABC.    ||   XX"/XXcm.    |

...and so on with rest of the columns from the web pages. Any help would be appreciated!! Thank you so much!!

Upvotes: 1

Views: 490

Answers (1)

n1colas.m
n1colas.m

Reputation: 3989

One option is to use the find('p') in the class rte text--pull" and use then get_text with a separator as argument (\n). Then, use the following regular expressions (or split the text variable, find the keyword and remove from the string) to get only the desired information. With the list rows in place, you can create the dataframe with pd.DataFrame(rows).

import re # import regex to get knife model and length
rows = [] # create list to hold dataframe rows

for links in product_links:
    HTML2 = requests.get(links)
    Booti2 = soup(HTML2.content,"html.parser")
    try:
        name = Booti2.find('h1',{'class': "product-meta__title heading h1" }).get_text()
        price =  Booti2.find('span',{'class': "price" }).get_text()
        Contents = Booti2.find('div',class_= "rte text--pull")
        text = Contents.find('p').get_text(separator='\n')
        model_num = re.search('Model Number: (.+?)\n', text).group(1)
        model_name = re.search('Model Name: (.+?)\n', text).group(1)
        overall_len = re.search('Overall Length: (.+?)\n', text).group(1)
        rows.append([name, price, model_num, model_name, overall_len])
    except:
        continue

In case you haven't done already, import pandas as pd.

import pandas as pd
df = pd.DataFrame(rows, columns=['name', 'price', 'model_num', 'model_name', 'overall_len'])
print(df)
                            name    price    model_num              model_name      overall_len
0   CIVIVI Altus Button Lock a...      $85     C20076-1                   Altus  7.12" / 180.8mm
1   CIVIVI Altus Button Lock a...      $90     C20076-3                   Altus  7.12" / 180.8mm
2   CIVIVI Altus Button Lock a...     $107   C20076-DS1                   Altus  7.12" / 180.8mm
3   CIVIVI Teton Tickler Fixed...  $258.50     C20072-1           Teton Tickler   10.16" / 258mm
4   CIVIVI Nox Flipper Knife G...   $76.50       C2110C                     NOx  6.80" / 172.7mm
...
...
40  CIVIVI Ortis Flipper Knife...     $105    C2013DS-1                   Ortis    7.48" / 190mm
41  CIVIVI Dogma Flipper Knife...   $79.50       C2014A                   Dogma   7.7" / 195.7mm
42  CIVIVI Dogma Flipper Knife...   $79.50       C2014B                   Dogma   7.7" / 195.7mm
43  CIVIVI Appalachian Drifter...      $98       C2015A     Appalachian Drifter   6.8" / 172.7mm

Upvotes: 1

Related Questions