Jonas
Jonas

Reputation: 99

How to replace unwanted text on beautifulsoup

I have a problem when trying to produce clean text from <p> element. Thing is that there is not everything I want to select and extract from <p>. I've tried using replace function and successfully removed the 2 parts of the text that I do not want, but for some reason it doesn't pick the third one.

First question: Why replace function doesn't pick the third text I have selected?

Second question: Is there a workaround to replace function? For example blacklisting <p> tags under <script> and so on? Code:

import requests
from bs4 import BeautifulSoup

url = 'https://www.cgtrader.com/3d-models'

response = requests.get(url)

soup = BeautifulSoup(response.content, 'html.parser')
content = soup.find_all('p')

for content2 in content:
    i = content2.get_text().replace('Type something to search', ' ').replace('Your shopping cart is empty.', ' ').replace('By subscribing you confirm that you have read and accept our Terms of Use', ' ')
    print(i)

"By subscribing you confirm that you have read and accept our Terms of Use"

This part of the print output is not being replaced for some reason.

Upvotes: 1

Views: 63

Answers (2)

Andrej Kesely
Andrej Kesely

Reputation: 195643

1.) The text is not replaced because it contains \xa0 character.

2.) To not get content of <p> tags which are under <script> tags, you can use .find_parent() method

Example:

import requests
from bs4 import BeautifulSoup

url = 'https://www.cgtrader.com/3d-models'

response = requests.get(url)

soup = BeautifulSoup(response.content, 'html.parser')
content = soup.find_all('p')

for content2 in content:
    if content2.find_parent('script'):
        continue
    i = content2.get_text().replace('\xa0',' ')
    i = i.replace('Type something to search', ' ').replace('Your shopping cart is empty.', ' ').replace('By subscribing you confirm that you have read and accept our Terms of Use', ' ')
    print(i)

Prints:

Find the exact right 3D content for your needs, including AR/VR, gaming, advertising, entertainment and others
Buy or free-download professional 3D models ready to be used in CG projects, film and video production, animation, visualizations, games, VR/AR, and others. Assets are available for download in many industry-accepted formats including MAX, OBJ, FBX, 3DS, STL, C4D, BLEND, MA, MB and other. If you are searching for high poly or real-time 3D assets, we have a leading digital art library for all your needs.
This category covers 3D aircraft. CG airplanes will fit into simulations, visualizations, advertisements and videos. Drone bodies and parts will delight fans of tiny flying vehicles. And the rigged models are ready to be imported into game engines.

...and so on.

Upvotes: 1

wasif
wasif

Reputation: 15528

The string was somehow incorrect, i copied and pasted and then worked:

i = content2.get_text().replace('Type something to search', ' ').replace('Your shopping cart is empty.', '').replace('By subscribing you confirm that you have read and accept our Terms of Use', ' ')
print(i)

Upvotes: 0

Related Questions