Ali
Ali

Reputation: 95

Find a word using BeautifulSoup

I want to extract ads that contain two special Persian words "توافق" or "توافقی" from a website. I am using BeautifulSoup and split the content in the soup to find the ads that have my special words, but my code does not work, May you please help me? Here is my simple code:

import requests
from bs4 import BeautifulSoup

r = requests.get("https://divar.ir/s/tehran")
soup = BeautifulSoup(r.text, "html.parser")
results = soup.find_all("div", attrs={"class": "kt-post-card__body"})
for content in results:
    words = content.split()
    if words == "توافقی" or words == "توافق":
        print(content)

Upvotes: 2

Views: 548

Answers (3)

HedgeHog
HedgeHog

Reputation: 25048

There are differnet issues first one, also mentioned by @Tim Roberts, you have to compare the list items with in:

if 'توافقی' in words or 'توافق' in words:

Second you have to seperate the texts from each of the child elements, so use get_text() with separator:

words=content.get_text(' ', strip=True)

Note: requests do not render dynamic content, it justs focus on static one

Example
import requests
from bs4 import BeautifulSoup
r=requests.get('https://divar.ir/s/tehran')
soup=BeautifulSoup(r.text,'html.parser')
results=soup.find_all('div',attrs={'class':"kt-post-card__body"})
for content in results:
    words=content.get_text(' ', strip=True)
    if 'توافقی' in words or 'توافق' in words:
        print(content.text) 

An alternative in this specific case could be the use of css selectors, so you could select the whole <article> and pick elements you need:

results = soup.select('article:-soup-contains("توافقی"),article:-soup-contains("توافق")')

for item in results:
    print(item.h2)
    print(item.span)

Upvotes: 2

Sachin Salve
Sachin Salve

Reputation: 146

so basically you are trying to split bs4 class and hence its giving error. Before splitting it, you need to convert it into text string.

import re
from bs4 import BeautifulSoup
import requests

r = requests.get("https://divar.ir/s/tehran")
soup = BeautifulSoup(r.text, "html.parser")

results = soup.find_all("div", attrs={"class": "kt-post-card__description"})
for content in results:
    words = content.text.split()
    if "توافقی" in words or "توافق" in words:
        print(content)

Upvotes: 1

S.B
S.B

Reputation: 16486

Since that توافقی is appeared in the div tags with kt-post-card__description class, I will use this. Then you can get the adds by using tag's properties like .previous_sibling or .parent or whatever...

import requests
from bs4 import BeautifulSoup

r = requests.get("https://divar.ir/s/tehran")
soup = BeautifulSoup(r.text, "html.parser")
results = soup.find_all("div", attrs={"class": "kt-post-card__description"})
for content in results:
    text = content.text
    if "توافقی" in text or "توافق" in text:
        print(content.previous_sibling)   # It's the h2 title.

Upvotes: 1

Related Questions