Reputation: 417
The webpage I'm scraping has paragraphs and headings structured this way:
<p>
<strong>
<a href="https://dummy.com" class="">This is a link heading
</strong>
</p>
<p>
Content To Be Pulled
</p>
I wrote the following code to pull the link heading's content:
for anchor in soup.find_all('#pcl-full-content > p > strong > a'):
signs.append(anchor.text)
The next part is confusing me because the text I want to collect next is the <p>
tag after the <p>
tag which contains the link. I cannot use .next_sibling()
here because it is outside of the parent <p>
tag.
How do I choose the following paragraph given that the <p>
before it contained a link?
Upvotes: 1
Views: 83
Reputation: 84465
One way seems to be to extract from script tag though you will need to split the text by horoscope:
import requests, re, json
r = requests.get('https://indianexpress.com/article/horoscope/weekly-horoscope-june-6-june-12-gemini-cancer-taurus-and-other-signs-check-astrological-prediction-7346080/',
headers = {'User-Agent':'Mozilla/5.0'})
data = json.loads(re.search(r'(\{"@context.*articleBody.*\})', r.text).group(1))
print(data['articleBody'])
You could get the horoscopes separately as follows. This dynamically determines which horoscopes are present, and in what order:
import requests, re, json
r = requests.get('https://indianexpress.com/article/horoscope/horoscope-today-april-6-2021-sagittarius-leo-aries-and-other-signs-check-astrological-prediction-7260276/',
headers = {'User-Agent':'Mozilla/5.0'})
data = json.loads(re.search(r'(\{"@context.*articleBody.*\})', r.text).group(1))
# print(data['articleBody'])
signs = ['ARIES', 'TAURUS', 'GEMINI', 'CANCER', 'LEO', 'VIRGO', 'LIBRA', 'SCORPIO', 'SAGITTARIUS', 'CAPRICORN', 'AQUARIUS', 'PISCES']
p = re.compile('|'.join(signs))
signs = p.findall(data['articleBody'])
for number, sign in enumerate(signs):
if number < len(signs) - 1:
print(re.search(f'({sign}.*?){signs[number + 1]}', data['articleBody']).group(1))
else:
print(re.search(f'({sign}.*)', data['articleBody']).group(1))
Upvotes: 1