Reputation: 22440
I've written a script in python
to scrape the description within Plot
from a webpage. The thing is the description are within several p
tags. There are other p
tags as well which I do not wish to scrape. As soon as my script is done parsing the description of Plot
, It should stop. However, my below script parses all the p
tags through the end starting from Plot
section.
How can I limit my script to parse the description of the Plot
only?
This what I've written:
import requests
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/Alien_(film)"
with requests.Session() as s:
s.headers={"User-Agent":"Mozilla/5.0"}
res = s.get(url)
soup = BeautifulSoup(res.text,"lxml")
plot = [item.text for item in soup.select_one("#Plot").find_parent().find_next_siblings("p")]
print(plot)
Upvotes: 1
Views: 43
Reputation:
You can pick paragraphs before next header, like
with requests.Session() as s:
s.headers={"User-Agent":"Mozilla/5.0"}
res = s.get(url)
soup = BeautifulSoup(res.text,"lxml")
plot_start = [item for item in soup.select_one("#Plot").find_parent().find_next_siblings()]
plot = []
for item in plot_start:
if item.name != 'h2':
plot.append(item.text)
else:
break
print(plot)
Upvotes: 1
Reputation: 52665
If it's not mandatory for you to use beautifulSoup, you can try below to get required piece of text content
from lxml import html
with requests.Session() as s:
s.headers={"User-Agent":"Mozilla/5.0"}
res = s.get(url)
source = html.fromstring(res.content)
plot = [item.text_content() for item in source.xpath('//p[preceding::h2[1][span="Plot"]]')]
print(plot)
Upvotes: 1