ALT
ALT

Reputation: 145

extract text from strong tag

I am trying to extract the information of exterior color, interior color, transmission respectively from cars.com.

HTML:

<ul class="listing-row__meta">
 <li>
  <strong>
    Ext. Color:
  </strong>
    Gray
 </li>
 <li>
  <strong>
    Int. Color:
  </strong>
    White
 </li>
 <li>
  <strong>
    Transmission:
  </strong>
    Automatic
 </li>

I tried the following code, but it showed 'expected string or bytes-like object'. Any recommendations or solutions will be appreciated.

from bs4 import BeautifulSoup
import urllib
import re

url ='https://www.cars.com/for-sale/searchresults.action/?zc=92617&rd=10&stkTypId=28881&mkId=28263&searchSource=RESEARCH_SHOP_INDEX'
response = requests.get(url)
page = response.text
soup = BeautifulSoup(page, 'lxml')

all_matches = soup.find_all('div',{'class':'shop-srp-listings__listing-container'})

for each in all_matches:

    info=each.findAll('ul',class_='listing-row__meta')
    pattern=re.compile(r'Ext. Color:')
    matches=pattern.finditer(info)
    for match in matches:
        print(match.text)

Upvotes: 1

Views: 857

Answers (4)

QHarr
QHarr

Reputation: 84465

There is absolutely no need for regex here. The html is regular and with bs4 4.7.1 + you can use :contains to target appropriate elements by their text and then next_sibling to get adjacent node containing value. Grab the lists, zip and convert to dataframe

import pandas as pd
import requests
from bs4 import BeautifulSoup as bs

headers = ['Make','Ext','Int','Trans','Drive']
r = requests.get('https://www.cars.com/for-sale/searchresults.action/?zc=92617&rd=10&stkTypId=28881&mkId=28263&searchSource=RESEARCH_SHOP_INDEX')
soup = bs(r.content, 'lxml')
make = [i.text.strip() for i in soup.select('.listing-row__title')]
ext_color = [i.next_sibling.strip() for i in soup.select('strong:contains("Ext. Color:")')]
int_color = [i.next_sibling.strip() for i in soup.select('strong:contains("Int. Color:")')]
transmission = [i.next_sibling.strip() for i in soup.select('strong:contains("Transmission:")')]
drive = [i.next_sibling.strip() for i in soup.select('strong:contains("Drivetrain:")')]
df = pd.DataFrame(zip(make, ext_color, int_color, transmission, drive), columns = headers)
print(df)

Upvotes: 0

Emma
Emma

Reputation: 27743

Maybe, this would be a bit closer to what you might be trying to extract, I guess, with an expression similar to:

(?is)<strong>\s*([^<]*?)\s*<\/strong>

or,

(?is)(?<=<strong>)\s*[^<]*?\s*(?=<\/strong>)

Pretty sure, you can do that with bs4 built-in functions too.

Test 1

from bs4 import BeautifulSoup
import urllib
import re
import requests

url = 'https://www.cars.com/for-sale/searchresults.action/?zc=92617&rd=10&stkTypId=28881&mkId=28263&searchSource=RESEARCH_SHOP_INDEX'
response = requests.get(url)
page = response.text
soup = BeautifulSoup(page, 'lxml')

all_matches = soup.find_all(
    'div', {'class': 'shop-srp-listings__listing-container'})

for each in all_matches:
    info = each.findAll('ul', class_='listing-row__meta')
    matches = re.findall(
        r'(?is)<strong>\s*[^<]*?\s*<\/strong>\s*([^<]*?)\s*<', str(info[0]))
    for match in matches:
        print(match)

Output 1

Gray
Beige
Automatic
AWD
Gray
White
Automatic
AWD
Black

Test 2

You can also make a dict out of that, if you like, by a bit modification:

from bs4 import BeautifulSoup
import urllib
import re
import requests

url = 'https://www.cars.com/for-sale/searchresults.action/?zc=92617&rd=10&stkTypId=28881&mkId=28263&searchSource=RESEARCH_SHOP_INDEX'
response = requests.get(url)
page = response.text
soup = BeautifulSoup(page, 'lxml')

all_matches = soup.find_all(
    'div', {'class': 'shop-srp-listings__listing-container'})

for each in all_matches:
    info = each.findAll('ul', class_='listing-row__meta')
    matches = dict(re.findall(
        r'(?is)<strong>\s*([^<]*?)\s*<\/strong>\s*([^<]*?)\s*<', str(info[0])))

    for k, v in matches.items():
        print(f'{k} {v}')

Output 2

Ext. Color: Gray
Int. Color: Beige
Transmission: Automatic
Drivetrain: AWD
Ext. Color: Gray
Int. Color: White
Transmission: Automatic
Drivetrain: AWD
Ext. Color: Black

Test 3

If you'd rather lists:

from bs4 import BeautifulSoup
import urllib
import re
import requests

url = 'https://www.cars.com/for-sale/searchresults.action/?zc=92617&rd=10&stkTypId=28881&mkId=28263&searchSource=RESEARCH_SHOP_INDEX'
response = requests.get(url)
page = response.text
soup = BeautifulSoup(page, 'lxml')

all_matches = soup.find_all(
    'div', {'class': 'shop-srp-listings__listing-container'})

for each in all_matches:
    info = each.findAll('ul', class_='listing-row__meta')
    matches = re.findall(
        r'(?is)<strong>\s*([^<]*?)\s*<\/strong>\s*([^<]*?)\s*<', str(info[0]))

    for match in matches:
        print(list(match))

Output

['Transmission:', 'Automatic']
['Drivetrain:', 'RWD']
['Ext. Color:', 'Gray']
['Int. Color:', 'Gray']
['Transmission:', 'Automatic']
['Drivetrain:', 'RWD']
['Ext. Color:', 'White']
['Int. Color:', 'Black']
['Transmission:', 'Automatic']
['Drivetrain:', 'RWD']
['Ext. Color:', 'White']
['Int. Color:', 'Beige']
['Transmission:', 'Automatic']
['Drivetrain:', 'AWD']
['Ext. Color:', 'Gray']
['Int. Color:', 'Beige']
['Transmission:', 'Automatic']
['Drivetrain:', 'AWD']
['Ext. Color:', 'White']

Test 4

from bs4 import BeautifulSoup
import urllib
import re
import requests

url = 'https://www.cars.com/for-sale/searchresults.action/?zc=92617&rd=10&stkTypId=28881&mkId=28263&searchSource=RESEARCH_SHOP_INDEX'
response = requests.get(url)
page = response.text
soup = BeautifulSoup(page, 'lxml')

all_matches = soup.find_all(
    'div', {'class': 'shop-srp-listings__listing-container'})


keys = ['Ext. Color', 'Int. Color', 'Transmission', 'Drivetrain']

outputs = dict()

for each in all_matches:
    info = each.findAll('ul', class_='listing-row__meta')
    matches = dict(re.findall(
        r'(?is)<strong>\s*([^<:]*?)\s*:\s*<\/strong>\s*([^<]*?)\s*<', str(info[0])))

    for item in matches.items():
        if item[0] not in outputs:
            outputs[item[0]] = [item[1]]
        if item[0] in keys:
            outputs[item[0]].append(item[1])

Output 4

{'Ext. Color': ['Silver', 'Silver', 'White', 'White', 'Black', 'Gray', 'Gray', 'Black', 'Black', 'White', 'Blue', 'Red', 'Silver', 'Gray', 'Black', 'White', 'Black', 'Gray', 'White', 'Black', 'Black'], 'Int. Color': ['Beige', 'Beige', 'Black', 'White', 'Black', 'Black', 'Gray', 'Beige', 'Black', 'Black', 'Beige', 'Beige', 'Black', 'Black', 'Black', 'Black', 'Black', 'Black', 'White', 'White', 'Black'], 'Transmission': ['Automatic', 'Automatic', 'Automatic', 'Automatic', 'Automatic', 'Automatic', 'Automatic', 'Automatic', 'Automatic', 'Automatic', 'Automatic', 'Automatic', 'Automatic', 'Automatic', 'Automatic', 'Automatic', 'Automatic', 'Automatic', 'Automatic', 'Automatic', 'Automatic'], 'Drivetrain': ['AWD', 'AWD', 'AWD', 'AWD', 'RWD', 'RWD', 'RWD', 'RWD', 'AWD', 'RWD', 'RWD', 'RWD', 'AWD', 'RWD', 'RWD', 'AWD', 'RWD', 'AWD', 'AWD', 'AWD', 'AWD']}

Test 5

from bs4 import BeautifulSoup
import urllib
import re
import requests

url = 'https://www.cars.com/for-sale/searchresults.action/?zc=92617&rd=10&stkTypId=28881&mkId=28263&searchSource=RESEARCH_SHOP_INDEX'
response = requests.get(url)
page = response.text
soup = BeautifulSoup(page, 'lxml')

all_matches = soup.find_all(
    'div', {'class': 'shop-srp-listings__listing-container'})


keys = ['Ext. Color', 'Int. Color', 'Transmission', 'Drivetrain']

outputs = dict()

for each in all_matches:
    info = each.findAll('ul', class_='listing-row__meta')
    matches = dict(re.findall(
        r'(?is)<strong>\s*([^<:]*?)\s*:\s*<\/strong>\s*([^<]*?)\s*<', str(info[0])))

    for item in matches.items():
        if item[0] not in outputs:
            outputs[item[0]] = [item[1]]
        if item[0] in keys:
            outputs[item[0]].append(item[1])


print(outputs)

print('*' * 50)

no_duplicate_outputs = dict()
for item in outputs.items():
    if item[0] not in no_duplicate_outputs:
        no_duplicate_outputs[item[0]] = list(set(item[1]))

print(no_duplicate_outputs)

Output 5

{'Ext. Color': ['Black', 'Black', 'White', 'Black', 'Other', 'Gray', 'White', 'White', 'Gray', 'White', 'Gray', 'Silver', 'Blue', 'Black', 'Silver', 'Silver', 'Black', 'Blue', 'Blue', 'Black', 'White'], 'Int. Color': ['Black', 'Black', 'Beige', 'Beige', 'Black', 'Gray', 'Black', 'Beige', 'Beige', 'White', 'Black', 'Black', 'Gray', 'Black', 'Black', 'Gray', 'Black', 'Black', 'Black', 'White', 'Black'], 'Transmission': ['Automatic', 'Automatic', 'Automatic', 'Automatic', 'Automatic', 'Automatic', 'Automatic', 'Automatic', 'Automatic', 'Automatic', 'Automatic', 'Automatic', 'Automatic', 'Automatic', 'Automatic', 'Automatic', 'Automatic', 'Automatic', 'Automatic', 'Automatic', 'Automatic'], 'Drivetrain': ['AWD', 'AWD', 'RWD', 'RWD', 'RWD', 'RWD', 'RWD', 'AWD', 'AWD', 'AWD', 'RWD', 'AWD', 'AWD', 'AWD', 'AWD', 'AWD', 'RWD', 'AWD', 'AWD', 'AWD', 'AWD']} ************************************************** {'Ext. Color': ['Silver', 'White', 'Blue', 'Other', 'Black', 'Gray'], 'Int. Color': ['Beige', 'White', 'Black', 'Gray'], 'Transmission': ['Automatic'], 'Drivetrain': ['RWD', 'AWD']}


If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.


RegEx Circuit

jex.im visualizes regular expressions:

enter image description here

Upvotes: 2

oppressionslayer
oppressionslayer

Reputation: 7224

The error your getting can be fixed by typecasting to str:

    matches=pattern.finditer(info)

change to:

    matches=pattern.finditer(str(info))

Upvotes: 1

j6m8
j6m8

Reputation: 2409

The Regex library's findAll function returns a List of results; so info is an array of strings, rather than a single string. You may need to iterate over each item in info, as well.

These objects return bs4.Tag objects (not strings), which can be cast to a string so they fit the finditer API. (This is particularly confusing because bs4 renders them as though they were strings when you print the object info!)

for each in all_matches:
    info = each.findAll('ul', class_='listing-row__meta')
    for item in info:
        pattern = re.compile(r'Ext. Color:')
        matches = pattern.finditer(str(item))
        for match in matches:
            print(match.text)

In this example, it's possible that info will be a List of length = 1; in this case, if you're sure you only want the first result, and that there will only be one result, you can convert to a call that returns a single occurrence, or simply use the first result with this line:

info = each.findAll('ul', class_='listing-row__meta')[0]

and then use your code from the question as-is.

Upvotes: 1

Related Questions