Can't scrape certain fields having messy format from a webpage

Question

I've written a script in python to get some items from a webpage. The thing is the content I wish to grab are not in tags, classes or ids separately. I'm only interested in address and phone. All of them are stacked in p tag. Given that I tried to gather them in the following manner.

site address

I've tried with:

import re
import requests
from bs4 import BeautifulSoup

url = 'https://ams.contractpackaging.org/i4a/memberDirectory/?controller=memberDirectory&action=resultsDetail&directory_id=6&detail_lookup_id=90DB59F83AFA02C0'

res = requests.get(url,headers={'User-Agent':'Mozilla/5.0'})
soup = BeautifulSoup(res.text,'lxml')

address = soup.find(class_="memeberDirectory_details").find("p").text.split("Phone")[0].strip()
phone = soup.find(class_="memeberDirectory_details").find("p",text=re.compile("Phone:(.*)"))
print(address,phone)

This yields (address includes name which is not I want):

Assemblers Inc.

2850 West Columbus Ave.


Chicago IL 60652

UNITED STATES
None

Expected output:

2850 West Columbus Ave.
Chicago IL 60652
UNITED STATES

(773) 378-3000

Andrej Kesely · Accepted Answer

You could try this code to extract address and phone:

import requests
from bs4 import BeautifulSoup
from itertools import takewhile

url = 'https://ams.contractpackaging.org/i4a/memberDirectory/?controller=memberDirectory&action=resultsDetail&directory_id=6&detail_lookup_id=90DB59F83AFA02C0'

soup = BeautifulSoup(requests.get(url).text, 'lxml')

address_soup = soup.select_one('.memeberDirectory_details > p')

# remove company name in  tag
for b in address_soup.select('b'):
    b.extract()

data = [val.strip() for val in address_soup.get_text(separator='|').split('|') if val.strip()]

address = [*takewhile(lambda k: 'Phone:' not in k, data)]
phone = [val.replace('Phone:', '').strip() for val in data if 'Phone:' in val]

print('Address:')
print('
'.join(address))
print()

print('Phone:')
print('
'.join(phone))

Prints:

Address: 2850 West Columbus Ave. Chicago IL 60652 UNITED STATES Phone: (773) 378-3000

EDIT:

To find text with regular expression, you could do this:

phone = soup.find(class_="memeberDirectory_details").find(text=re.compile("Phone:(.*)")) print(phone)

Prints:

Phone: (773) 378-3000

Can't scrape certain fields having messy format from a webpage

Answers (2)

Related Questions

Can&#39;t scrape certain fields having messy format from a webpage

Answers (2)

Related Questions

Can't scrape certain fields having messy format from a webpage