shippy
shippy

Reputation: 23

Beautifulsoup - Scraping from a specific class that contains h4

please see the picture of codes of the website

there is a lot of <div class=event-sub-lists but I want the one with h4 that contains 2021. that's all I want. but I couldn't create a if clause or smt else. how can I do that, can you explain? thanks in advance!!

from bs4 import BeautifulSoup
from docx import Document
from docx.shared import Pt
import requests

user_agent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.``3945.88 Safari/537.37"
url = "https://www.fpri.org/events/archive/"
data = requests.get(url, headers={"User-Agent": user_agent})
soup = BeautifulSoup(data.text, "lxml")

document = Document()

events = soup.find_all("div", class_ = "events-sub-list")
for event in events:
    event_name = event.find("li")
    link = event.find("a")
    try:
        print(event_name.text)
        document.add_paragraph(event_name.text, style='List Bullet')
        print(link['href'])
        document.add_paragraph(link['href'])
    except:
        continue

document.save('demo.docx')

Upvotes: 1

Views: 501

Answers (3)

CutePoison
CutePoison

Reputation: 5355

You can get the text within each tag by using tag.text i.e

div = soup.find_all("div", class="events-sub-list")
h4 =[p for p in div if "2021" in p.text]

or more comprehensive (note, you do not get only the h4 from the specific div as I'm your example this way)


h4= soup.find_all("h4")
h4 =[p for p in h4 if "2021" in p.text]

Upvotes: 2

Md. Fazlul Hoque
Md. Fazlul Hoque

Reputation: 16187

Try now:

div = soup.find_all("div", class_ ="events-sub-list").h4
get_2021 =[p.text for p in div if "2021" in p]

Upvotes: 1

Andrej Kesely
Andrej Kesely

Reputation: 195468

To get correct response from server, set User-Agent HTTP header:

import requests
from bs4 import BeautifulSoup


url = "https://www.fpri.org/events/archive/"
headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0"
}

soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")

for li in soup.select('h4:-soup-contains("2021") + ul li'):
    print(li.text)

Prints:

Haiti, Cuba, and the History of U.S. Involvement in the Caribbean  - Barbara Fick - July 29, 2021 -  Events  
Tug-of-War in the Black Sea: Defending NATO’s Eastern Flank  - Maia Otarashvili - July 15, 2021 -  Events  
Freedom of the Border  - Ronald J. Granieri - July 13, 2021 -  People, Politics, and Prose  
The Future of U.S.-China Proxy War  - Aaron Stein - July 6, 2021 -  Events  
The “Polypandemic” Threat: Impacts on Development, Fragility, and Conflict  - Nikolas K. Gvosdev - June 29, 2021 -  Events  
Difficult Choices: Taiwan’s Quest for Security and the Good Life—a book talk with Richard Bush  - Jacques deLisle - June 24, 2021 -  Events  
Why Africa Matters: The Official Launch of FPRI’s Africa Program  - Charles A. Ray - June 17, 2021 -  Events  
We Shall Be Masters: Russian Pivots to East Asia from Peter the Great to Putin  - Ronald J. Granieri - June 15, 2021 -  People, Politics, and Prose  


...and so on.

Upvotes: 2

Related Questions