Noob
Noob

Reputation: 117

How to get specific data within P tag during web scraping?

I'm trying to scrape data from a website which has information inside P tag. The only data i'm interested in is contact which is in the same P tag. How can i get only the required data?

Here is the ss of the website. How can i get the text from Company to tel no.?

Here is the ss of the website. How can i get the text from Company to tel no.?

Upvotes: 1

Views: 1105

Answers (2)

Andrej Kesely
Andrej Kesely

Reputation: 195573

You can use re module to parse the text.

For example:

import re
import requests
from bs4 import BeautifulSoup

url = 'https://www.forpressrelease.com/forpressrelease/553538/4/china-leading-cabinet-handles-supplier-rochehandle-celebrates-success-of-entering-european-market'

soup = BeautifulSoup(requests.get(url).content, 'html.parser')

txt = soup.select_one('.single_page_content').get_text(strip=True, separator='\n')

company = re.findall(r'Company:\s*(.*)', txt)[0]
address = re.findall(r'Address:\s*(.*)', txt)[0]
contact = re.findall(r'Contact:\s*(.*)', txt)[0]
email = re.findall(r'Email:\s*(.*?)\s*(?=\w+:)', txt, flags=re.S)[0]
tel = re.findall(r'Tel:\s*(.*)', txt)[0]
mob = re.findall(r'Mob:\s*(.*)', txt)[0]
url = re.findall(r'Url\s*:\s*-\s*(.*)', txt, flags=re.S)[0]

print('{:<15}: {}'.format('Company', company))
print('{:<15}: {}'.format('Address', address))
print('{:<15}: {}'.format('Contact', contact))
print('{:<15}: {}'.format('Email', email))
print('{:<15}: {}'.format('Tel', tel))
print('{:<15}: {}'.format('Mob', mob))
print('{:<15}: {}'.format('Url', url))

Prints:

Company        : Dongguan Roche Industrial Co., Ltd
Address        : No.83, XiZheng 1st Road, Shajiao Community, Humen Town, Dongguan City, Guangdong Province, China 523936
Contact        : Robin Luo
Email          : [email protected]
Tel            : 0769-89366747
Mob            : +86-13392706499
Url            : https://www.rochehandle.com

Upvotes: 2

ababak
ababak

Reputation: 1803

You need to use regular expressions to parse the <P> block you get from BeautifulSoup:

import re

text_from_p = """
some text
some more
Tel: 0234-234345-45

some more text
"""

match = re.search(r"Tel: (?P<tel>[0-9\- ]*)", text_from_p)
if match:
    print(match.group("tel"))
else:
    print("Tel not found")

You get:

0234-234345-45

Upvotes: 1

Related Questions