Aman
Aman

Reputation: 417

BeautifulSoup: Select P tag that comes after another P tag which should contain a link

The webpage I'm scraping has paragraphs and headings structured this way:

<p>
  <strong>
  <a href="https://dummy.com" class="">This is a link heading
  </strong>
</p>
<p>
  Content To Be Pulled
</p>

I wrote the following code to pull the link heading's content:

 for anchor in soup.find_all('#pcl-full-content > p > strong > a'):
    signs.append(anchor.text)

The next part is confusing me because the text I want to collect next is the <p> tag after the <p> tag which contains the link. I cannot use .next_sibling() here because it is outside of the parent <p> tag.

How do I choose the following paragraph given that the <p> before it contained a link?

Upvotes: 1

Views: 83

Answers (1)

QHarr
QHarr

Reputation: 84465

One way seems to be to extract from script tag though you will need to split the text by horoscope:

import requests, re, json

r = requests.get('https://indianexpress.com/article/horoscope/weekly-horoscope-june-6-june-12-gemini-cancer-taurus-and-other-signs-check-astrological-prediction-7346080/',
                  headers = {'User-Agent':'Mozilla/5.0'})

data = json.loads(re.search(r'(\{"@context.*articleBody.*\})', r.text).group(1))
print(data['articleBody'])

You could get the horoscopes separately as follows. This dynamically determines which horoscopes are present, and in what order:

import requests, re, json

r = requests.get('https://indianexpress.com/article/horoscope/horoscope-today-april-6-2021-sagittarius-leo-aries-and-other-signs-check-astrological-prediction-7260276/',
                  headers = {'User-Agent':'Mozilla/5.0'})

data = json.loads(re.search(r'(\{"@context.*articleBody.*\})', r.text).group(1))
# print(data['articleBody'])
signs = ['ARIES', 'TAURUS', 'GEMINI', 'CANCER', 'LEO', 'VIRGO', 'LIBRA', 'SCORPIO', 'SAGITTARIUS', 'CAPRICORN', 'AQUARIUS', 'PISCES']
p = re.compile('|'.join(signs))
signs = p.findall(data['articleBody'])

for number, sign in enumerate(signs):
    if number < len(signs) - 1:
        print(re.search(f'({sign}.*?){signs[number + 1]}', data['articleBody']).group(1))
    else:
        print(re.search(f'({sign}.*)', data['articleBody']).group(1))

Upvotes: 1

Related Questions