SIM
SIM

Reputation: 22440

Can't limit my script to parse a specific section from a webpage

I've written a script in python to scrape the description within Plot from a webpage. The thing is the description are within several p tags. There are other p tags as well which I do not wish to scrape. As soon as my script is done parsing the description of Plot, It should stop. However, my below script parses all the p tags through the end starting from Plot section.

How can I limit my script to parse the description of the Plot only?

This what I've written:

import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Alien_(film)"

with requests.Session() as s:
    s.headers={"User-Agent":"Mozilla/5.0"}
    res = s.get(url)
    soup = BeautifulSoup(res.text,"lxml")
    plot = [item.text for item in soup.select_one("#Plot").find_parent().find_next_siblings("p")]
    print(plot)

Upvotes: 1

Views: 43

Answers (2)

user308738
user308738

Reputation:

You can pick paragraphs before next header, like

with requests.Session() as s:
    s.headers={"User-Agent":"Mozilla/5.0"}
    res = s.get(url)
    soup = BeautifulSoup(res.text,"lxml")

    plot_start = [item for item in soup.select_one("#Plot").find_parent().find_next_siblings()]
    plot = []
    for item in plot_start:
        if item.name != 'h2':
            plot.append(item.text)
        else:
            break
    print(plot)

Upvotes: 1

Andersson
Andersson

Reputation: 52665

If it's not mandatory for you to use beautifulSoup, you can try below to get required piece of text content

from lxml import html

with requests.Session() as s:
    s.headers={"User-Agent":"Mozilla/5.0"}
    res = s.get(url)
    source = html.fromstring(res.content)
    plot = [item.text_content() for item in source.xpath('//p[preceding::h2[1][span="Plot"]]')]
    print(plot)

Upvotes: 1

Related Questions