Parse HTML table data with BeautifulSoup into a dict

Question

I am trying to use BeautifulSoup to parse the information stored in an HTML table and store it into a dict. I've been able to get to the table, and iterate through the values, but there is still a lot of junk in the table that I'm not sure how to take care of.

# load the HTML file
r = requests.get("http://www.ebay.com/itm/222378225962")
soup = BeautifulSoup(r.content, "html.parser")

# navigate to the item attributes table
table = soup.find('div', 'itemAttr')

# iterate through the attribute information
attr = []
for i in table.findAll("tr"):
    attr.append(i.text.strip().replace('	', ''))

With this method, this is what the data looks like. As you you see, there is a lot of junk in there, and some lines contain multiple items like Year and VIN.

[u'Condition:
Used',
 u'Seller Notes:
\u201cExcellent Condition\u201d',
 u'Year: 

2015

 VIN (Vehicle Identification Number): 

2G1FJ1EW2F9192023',
 u'Mileage: 

29,000

 Transmission: 

Manual',
 u'Make: 

Chevrolet

 Body Type: 

Coupe',
 u'Model: 

Camaro

 Warranty: 

Vehicle has an existing warranty',
 u'Trim: 

SS Coupe 2-Door

 Vehicle Title: 

Clear',
 u'Engine: 

6.2L 6162CC 376Cu. In. V8 GAS OHV Naturally Aspirated

 Options: 

Leather Seats',
 u'Drive Type: 

RWD

 Safety Features: 

Anti-Lock Brakes, Driver Airbag, Passenger Airbag, Side Airbags',
 u'Power Options: 

Air Conditioning, Cruise Control, Power Locks, Power Windows, Power Seats

 Sub Model: 

1LE',
 u'Fuel Type: 

Gasoline

 Color: 

White',
 u'For Sale By: 

Private Seller

 Interior Color: 

Black',
 u'Disability Equipped: 

No

 Number of Cylinders: 

8',
 u'']

Ultimately, I want the data to be stored in a dictionary like below. I know how to create a dictionary, but don't know how to clean up the data that needs to go into the dictionary without brute force find-and-replace.

{'Condition' : 'Used',
 'Seller Notes' : 'Excellent Condition',
 'Year': '2015',
 'VIN (Vehicle Identification Number)': '2G1FJ1EW2F9192023',
 'Mileage': '29,000', 
 'Transmission': 'Manual',
 'Make': 'Chevrolet', 
 'Body Type': 'Coupe',
 'Model': 'Camaro', 
 'Warranty': 'Vehicle has an existing warranty',
 'Trim': 'SS Coupe 2-Door',
 'Vehicle Title' : 'Clear',
 'Engine': '6.2L 6162CC 376Cu. In. V8 GAS OHV Naturally Aspirated', 
 'Options': 'Leather Seats',
 'Drive Type': 'RWD', 
 'Safety Features' : 'Anti-Lock Brakes, Driver Airbag, Passenger Airbag, Side Airbags',
 'Power Options' : 'Air Conditioning, Cruise Control, Power Locks, Power Windows, Power Seats',
 'Sub Model' : '1LE',
 'Fuel Type' : 'Gasoline', 
 'Exterior Color' : 'White',
 'For Sale By' : 'Private Seller', 
 'Interior Color' : 'Black',
 'Disability Equipped' : 'No', 
 'Number of Cylinders': '8'}

Josh Crozier · Accepted Answer

Rather than trying to parse out the data from the tr elements, a better approach would be to iterate over the td.attrLabels data elements. You can use these labels as the key, and then use the adjacent sibling elements as the value.

In the example below, the CSS selector div.itemAttr td.attrLabels is used to select all td elements with .attrLabels classes that are descendants of the div.itemAttr. From there, the method .find_next_sibling() is used to find the adjacent sibling element.

r = requests.get("http://www.ebay.com/itm/222378225962")
soup = BeautifulSoup(r.content, 'lxml')

data = []
for label in soup.select('div.itemAttr td.attrLabels'):
    data.append({ label.text.strip(): label.find_next_sibling().text.strip() })

Output:

> [{'Year:': '2015'}, {'VIN (Vehicle Identification Number):': '2G1FJ1EW2F9192023'}, {'Mileage:': '29,000'}, {'Transmission:': 'Manual'}, {'Make:': 'Chevrolet'}, {'Body Type:': 'Coupe'}, {'Model:': 'Camaro'}, {'Warranty:': 'Vehicle has an existing warranty'}, {'Trim:': 'SS Coupe 2-Door'}, {'Vehicle Title:': 'Clear'}, {'Engine:': '6.2L 6162CC 376Cu. In. V8 GAS OHV Naturally Aspirated'}, {'Options:': 'Leather Seats'}, {'Drive Type:': 'RWD'}, {'Safety Features:': 'Anti-Lock Brakes, Driver Airbag, Passenger Airbag, Side Airbags'}, {'Power Options:': 'Air Conditioning, Cruise Control, Power Locks, Power Windows, Power Seats'}, {'Sub Model:': '1LE'}, {'Fuel Type:': 'Gasoline'}, {'Exterior Color:': 'White'}, {'For Sale By:': 'Private Seller'}, {'Interior Color:': 'Black'}, {'Disability Equipped:': 'No'}, {'Number of Cylinders:': '8'}]

If you also want to retrieve the table header th elements, then you could select the table element and then use the CSS selector th, td.attrLabels in order to retrieve both labels:

r = requests.get("http://www.ebay.com/itm/222378225962")
soup = BeautifulSoup(r.content, 'lxml')
table = soup.find('div', 'itemAttr')

data = []
for label in table.select('th, td.attrLabels'):
    data.append({ label.text.strip(): label.find_next_sibling().text.strip() })

Output:

> [{'Condition:': 'Used'}, {'Seller Notes:': '“Excellent Condition”'}, {'Year:': '2015'}, {'VIN (Vehicle Identification Number):': '2G1FJ1EW2F9192023'}, {'Mileage:': '29,000'}, {'Transmission:': 'Manual'}, {'Make:': 'Chevrolet'}, {'Body Type:': 'Coupe'}, {'Model:': 'Camaro'}, {'Warranty:': 'Vehicle has an existing warranty'}, {'Trim:': 'SS Coupe 2-Door'}, {'Vehicle Title:': 'Clear'}, {'Engine:': '6.2L 6162CC 376Cu. In. V8 GAS OHV Naturally Aspirated'}, {'Options:': 'Leather Seats'}, {'Drive Type:': 'RWD'}, {'Safety Features:': 'Anti-Lock Brakes, Driver Airbag, Passenger Airbag, Side Airbags'}, {'Power Options:': 'Air Conditioning, Cruise Control, Power Locks, Power Windows, Power Seats'}, {'Sub Model:': '1LE'}, {'Fuel Type:': 'Gasoline'}, {'Exterior Color:': 'White'}, {'For Sale By:': 'Private Seller'}, {'Interior Color:': 'Black'}, {'Disability Equipped:': 'No'}, {'Number of Cylinders:': '8'}]

If you want to strip out non-alphanumeric character(s) for the keys, then you could use:

r = requests.get("http://www.ebay.com/itm/222378225962")
soup = BeautifulSoup(r.content, 'lxml')
table = soup.find('div', 'itemAttr')

data = []
for label in table.select('th, td.attrLabels'):
    key = re.sub(r'\W+', '', label.text.strip())
    value = label.find_next_sibling().text.strip()

    data.append({ key: value })

Output:

> [{'Condition': 'Used'}, {'SellerNotes': '“Excellent Condition”'}, {'Year': '2015'}, {'VINVehicleIdentificationNumber': '2G1FJ1EW2F9192023'}, {'Mileage': '29,000'}, {'Transmission': 'Manual'}, {'Make': 'Chevrolet'}, {'BodyType': 'Coupe'}, {'Model': 'Camaro'}, {'Warranty': 'Vehicle has an existing warranty'}, {'Trim': 'SS Coupe 2-Door'}, {'VehicleTitle': 'Clear'}, {'Engine': '6.2L 6162CC 376Cu. In. V8 GAS OHV Naturally Aspirated'}, {'Options': 'Leather Seats'}, {'DriveType': 'RWD'}, {'SafetyFeatures': 'Anti-Lock Brakes, Driver Airbag, Passenger Airbag, Side Airbags'}, {'PowerOptions': 'Air Conditioning, Cruise Control, Power Locks, Power Windows, Power Seats'}, {'SubModel': '1LE'}, {'FuelType': 'Gasoline'}, {'ExteriorColor': 'White'}, {'ForSaleBy': 'Private Seller'}, {'InteriorColor': 'Black'}, {'DisabilityEquipped': 'No'}, {'NumberofCylinders': '8'}]

Parse HTML table data with BeautifulSoup into a dict

Answers (1)

Related Questions