Reputation: 11
I'm running this code in order to scrape zip codes from a website using BS4.
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = "https://example.com"
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
# html parsing
page_soup = soup(page_html, "html.parser")
# grabs each zip code
zip_code = page_soup.findAll("span",{"itemprop":"postalCode"})
print(zip_code)
I end up with this code, which is just a list of spans. Each span contains the zip code that I need.
[<span itemprop="postalCode">03257</span>, <span
itemprop="postalCode">34240</span>, <span
itemprop="postalCode">84660</span>, <span
itemprop="postalCode">07717</span>]
However, I cannot figure out how to remove everything BUT the zip code that is in between the span tags. The goal is to end up with a list of zip codes only.
Thank you for the help.
Upvotes: 0
Views: 72
Reputation: 633
To get only the text from a tag, use tag.text
:
zip_codes = page_soup.find_all("span", {"itemprop": "postalCode"})
zip_codes = [tag.text for tag in zip_codes]
print(zip_codes) # ['03257', '34240', '84660', '07717']
Upvotes: 2