CurtLH
CurtLH

Reputation: 2417

Parse HTML table data with BeautifulSoup into a dict

I am trying to use BeautifulSoup to parse the information stored in an HTML table and store it into a dict. I've been able to get to the table, and iterate through the values, but there is still a lot of junk in the table that I'm not sure how to take care of.

# load the HTML file
r = requests.get("http://www.ebay.com/itm/222378225962")
soup = BeautifulSoup(r.content, "html.parser")

# navigate to the item attributes table
table = soup.find('div', 'itemAttr')

# iterate through the attribute information
attr = []
for i in table.findAll("tr"):
    attr.append(i.text.strip().replace('\t', ''))

With this method, this is what the data looks like. As you you see, there is a lot of junk in there, and some lines contain multiple items like Year and VIN.

[u'Condition:\nUsed',
 u'Seller Notes:\n\u201cExcellent Condition\u201d',
 u'Year: \n\n2015\n\n VIN (Vehicle Identification Number): \n\n2G1FJ1EW2F9192023',
 u'Mileage: \n\n29,000\n\n Transmission: \n\nManual',
 u'Make: \n\nChevrolet\n\n Body Type: \n\nCoupe',
 u'Model: \n\nCamaro\n\n Warranty: \n\nVehicle has an existing warranty',
 u'Trim: \n\nSS Coupe 2-Door\n\n Vehicle Title: \n\nClear',
 u'Engine: \n\n6.2L 6162CC 376Cu. In. V8 GAS OHV Naturally Aspirated\n\n Options: \n\nLeather Seats',
 u'Drive Type: \n\nRWD\n\n Safety Features: \n\nAnti-Lock Brakes, Driver Airbag, Passenger Airbag, Side Airbags',
 u'Power Options: \n\nAir Conditioning, Cruise Control, Power Locks, Power Windows, Power Seats\n\n Sub Model: \n\n1LE',
 u'Fuel Type: \n\nGasoline\n\n Color: \n\nWhite',
 u'For Sale By: \n\nPrivate Seller\n\n Interior Color: \n\nBlack',
 u'Disability Equipped: \n\nNo\n\n Number of Cylinders: \n\n8',
 u'']

Ultimately, I want the data to be stored in a dictionary like below. I know how to create a dictionary, but don't know how to clean up the data that needs to go into the dictionary without brute force find-and-replace.

{'Condition' : 'Used',
 'Seller Notes' : 'Excellent Condition',
 'Year': '2015',
 'VIN (Vehicle Identification Number)': '2G1FJ1EW2F9192023',
 'Mileage': '29,000', 
 'Transmission': 'Manual',
 'Make': 'Chevrolet', 
 'Body Type': 'Coupe',
 'Model': 'Camaro', 
 'Warranty': 'Vehicle has an existing warranty',
 'Trim': 'SS Coupe 2-Door',
 'Vehicle Title' : 'Clear',
 'Engine': '6.2L 6162CC 376Cu. In. V8 GAS OHV Naturally Aspirated', 
 'Options': 'Leather Seats',
 'Drive Type': 'RWD', 
 'Safety Features' : 'Anti-Lock Brakes, Driver Airbag, Passenger Airbag, Side Airbags',
 'Power Options' : 'Air Conditioning, Cruise Control, Power Locks, Power Windows, Power Seats',
 'Sub Model' : '1LE',
 'Fuel Type' : 'Gasoline', 
 'Exterior Color' : 'White',
 'For Sale By' : 'Private Seller', 
 'Interior Color' : 'Black',
 'Disability Equipped' : 'No', 
 'Number of Cylinders': '8'}

Upvotes: 1

Views: 5032

Answers (1)

Josh Crozier
Josh Crozier

Reputation: 241248

Rather than trying to parse out the data from the tr elements, a better approach would be to iterate over the td.attrLabels data elements. You can use these labels as the key, and then use the adjacent sibling elements as the value.

In the example below, the CSS selector div.itemAttr td.attrLabels is used to select all td elements with .attrLabels classes that are descendants of the div.itemAttr. From there, the method .find_next_sibling() is used to find the adjacent sibling element.

r = requests.get("http://www.ebay.com/itm/222378225962")
soup = BeautifulSoup(r.content, 'lxml')

data = []
for label in soup.select('div.itemAttr td.attrLabels'):
    data.append({ label.text.strip(): label.find_next_sibling().text.strip() })

Output:

> [{'Year:': '2015'}, {'VIN (Vehicle Identification Number):': '2G1FJ1EW2F9192023'}, {'Mileage:': '29,000'}, {'Transmission:': 'Manual'}, {'Make:': 'Chevrolet'}, {'Body Type:': 'Coupe'}, {'Model:': 'Camaro'}, {'Warranty:': 'Vehicle has an existing warranty'}, {'Trim:': 'SS Coupe 2-Door'}, {'Vehicle Title:': 'Clear'}, {'Engine:': '6.2L 6162CC 376Cu. In. V8 GAS OHV Naturally Aspirated'}, {'Options:': 'Leather Seats'}, {'Drive Type:': 'RWD'}, {'Safety Features:': 'Anti-Lock Brakes, Driver Airbag, Passenger Airbag, Side Airbags'}, {'Power Options:': 'Air Conditioning, Cruise Control, Power Locks, Power Windows, Power Seats'}, {'Sub Model:': '1LE'}, {'Fuel Type:': 'Gasoline'}, {'Exterior Color:': 'White'}, {'For Sale By:': 'Private Seller'}, {'Interior Color:': 'Black'}, {'Disability Equipped:': 'No'}, {'Number of Cylinders:': '8'}]

If you also want to retrieve the table header th elements, then you could select the table element and then use the CSS selector th, td.attrLabels in order to retrieve both labels:

r = requests.get("http://www.ebay.com/itm/222378225962")
soup = BeautifulSoup(r.content, 'lxml')
table = soup.find('div', 'itemAttr')

data = []
for label in table.select('th, td.attrLabels'):
    data.append({ label.text.strip(): label.find_next_sibling().text.strip() })

Output:

> [{'Condition:': 'Used'}, {'Seller Notes:': '“Excellent Condition”'}, {'Year:': '2015'}, {'VIN (Vehicle Identification Number):': '2G1FJ1EW2F9192023'}, {'Mileage:': '29,000'}, {'Transmission:': 'Manual'}, {'Make:': 'Chevrolet'}, {'Body Type:': 'Coupe'}, {'Model:': 'Camaro'}, {'Warranty:': 'Vehicle has an existing warranty'}, {'Trim:': 'SS Coupe 2-Door'}, {'Vehicle Title:': 'Clear'}, {'Engine:': '6.2L 6162CC 376Cu. In. V8 GAS OHV Naturally Aspirated'}, {'Options:': 'Leather Seats'}, {'Drive Type:': 'RWD'}, {'Safety Features:': 'Anti-Lock Brakes, Driver Airbag, Passenger Airbag, Side Airbags'}, {'Power Options:': 'Air Conditioning, Cruise Control, Power Locks, Power Windows, Power Seats'}, {'Sub Model:': '1LE'}, {'Fuel Type:': 'Gasoline'}, {'Exterior Color:': 'White'}, {'For Sale By:': 'Private Seller'}, {'Interior Color:': 'Black'}, {'Disability Equipped:': 'No'}, {'Number of Cylinders:': '8'}]

If you want to strip out non-alphanumeric character(s) for the keys, then you could use:

r = requests.get("http://www.ebay.com/itm/222378225962")
soup = BeautifulSoup(r.content, 'lxml')
table = soup.find('div', 'itemAttr')

data = []
for label in table.select('th, td.attrLabels'):
    key = re.sub(r'\W+', '', label.text.strip())
    value = label.find_next_sibling().text.strip()

    data.append({ key: value })

Output:

> [{'Condition': 'Used'}, {'SellerNotes': '“Excellent Condition”'}, {'Year': '2015'}, {'VINVehicleIdentificationNumber': '2G1FJ1EW2F9192023'}, {'Mileage': '29,000'}, {'Transmission': 'Manual'}, {'Make': 'Chevrolet'}, {'BodyType': 'Coupe'}, {'Model': 'Camaro'}, {'Warranty': 'Vehicle has an existing warranty'}, {'Trim': 'SS Coupe 2-Door'}, {'VehicleTitle': 'Clear'}, {'Engine': '6.2L 6162CC 376Cu. In. V8 GAS OHV Naturally Aspirated'}, {'Options': 'Leather Seats'}, {'DriveType': 'RWD'}, {'SafetyFeatures': 'Anti-Lock Brakes, Driver Airbag, Passenger Airbag, Side Airbags'}, {'PowerOptions': 'Air Conditioning, Cruise Control, Power Locks, Power Windows, Power Seats'}, {'SubModel': '1LE'}, {'FuelType': 'Gasoline'}, {'ExteriorColor': 'White'}, {'ForSaleBy': 'Private Seller'}, {'InteriorColor': 'Black'}, {'DisabilityEquipped': 'No'}, {'NumberofCylinders': '8'}]

Upvotes: 4

Related Questions