Reputation: 154
I've written a script in python to get some items from a webpage. The thing is the content I wish to grab are not in tags, classes or ids separately. I'm only interested in address
and phone
. All of them are stacked in p
tag. Given that I tried to gather them in the following manner.
I've tried with:
import re
import requests
from bs4 import BeautifulSoup
url = 'https://ams.contractpackaging.org/i4a/memberDirectory/?controller=memberDirectory&action=resultsDetail&directory_id=6&detail_lookup_id=90DB59F83AFA02C0'
res = requests.get(url,headers={'User-Agent':'Mozilla/5.0'})
soup = BeautifulSoup(res.text,'lxml')
address = soup.find(class_="memeberDirectory_details").find("p").text.split("Phone")[0].strip()
phone = soup.find(class_="memeberDirectory_details").find("p",text=re.compile("Phone:(.*)"))
print(address,phone)
This yields (address includes name which is not I want):
Assemblers Inc.
2850 West Columbus Ave.
Chicago IL 60652
UNITED STATES
None
Expected output:
2850 West Columbus Ave.
Chicago IL 60652
UNITED STATES
(773) 378-3000
Upvotes: 0
Views: 43
Reputation: 490
Instead of finding and splitting at the <p>
tag then finding each individual field, split at <p>
and store all the <br>
items in a list. If the elements of the lists don't change in size, you can always pop off the first element of the list. If you wish to go down your route, you can split the the address at the first instance of a number, but this would error out on company names that have a number in it.
Upvotes: 0
Reputation: 195408
You could try this code to extract address and phone:
import requests
from bs4 import BeautifulSoup
from itertools import takewhile
url = 'https://ams.contractpackaging.org/i4a/memberDirectory/?controller=memberDirectory&action=resultsDetail&directory_id=6&detail_lookup_id=90DB59F83AFA02C0'
soup = BeautifulSoup(requests.get(url).text, 'lxml')
address_soup = soup.select_one('.memeberDirectory_details > p')
# remove company name in <b> tag
for b in address_soup.select('b'):
b.extract()
data = [val.strip() for val in address_soup.get_text(separator='|').split('|') if val.strip()]
address = [*takewhile(lambda k: 'Phone:' not in k, data)]
phone = [val.replace('Phone:', '').strip() for val in data if 'Phone:' in val]
print('Address:')
print('\n'.join(address))
print()
print('Phone:')
print('\n'.join(phone))
Prints:
Address:
2850 West Columbus Ave.
Chicago IL 60652
UNITED STATES
Phone:
(773) 378-3000
EDIT:
To find text with regular expression, you could do this:
phone = soup.find(class_="memeberDirectory_details").find(text=re.compile("Phone:(.*)"))
print(phone)
Prints:
Phone: (773) 378-3000
Upvotes: 1