sandy
sandy

Reputation: 191

How to extract html links with a matching word from a website using python

I have an url, say http://www.bbc.com/news/world/asia/. Just in this page I wanted to extract all the links that has India or INDIA or india (should be case insensitive).

If I click any of the output links it should take me to the corresponding page, for example these are few lines that have india India shock over Dhoni retirement and India fog continues to cause chaos. If I click these links I should be redirected to http://www.bbc.com/news/world-asia-india-30640436 and http://www.bbc.com/news/world-asia-india-30630274 respectively.

import urllib
from bs4 import BeautifulSoup
import re
import requests
url = "http://www.bbc.com/news/world/asia/"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
only_links = SoupStrainer('a', href=re.compile('india'))
print (only_links)

I wrote very basic minimal code in python 3.4.2.

Upvotes: 4

Views: 2776

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1121256

You need to search for the word india in the displayed text. To do this you'll need a custom function instead:

from bs4 import BeautifulSoup
import requests

url = "http://www.bbc.com/news/world/asia/"
r = requests.get(url)
soup = BeautifulSoup(r.content)

india_links = lambda tag: (getattr(tag, 'name', None) == 'a' and
                           'href' in tag.attrs and
                           'india' in tag.get_text().lower())
results = soup.find_all(india_links)

The india_links lambda finds all tags that are <a> links with an href attribute and contain india (case insensitive) somewhere in the displayed text.

Note that I used the requests response object .content attribute; leave decoding to BeautifulSoup!

Demo:

>>> from bs4 import BeautifulSoup
>>> import requests
>>> url = "http://www.bbc.com/news/world/asia/"
>>> r = requests.get(url)
>>> soup = BeautifulSoup(r.content)
>>> india_links = lambda tag: getattr(tag, 'name', None) == 'a' and 'href' in tag.attrs and 'india' in tag.get_text().lower()
>>> results = soup.find_all(india_links)
>>> from pprint import pprint
>>> pprint(results)
[<a href="/news/world/asia/india/">India</a>,
 <a class="story" href="/news/world-asia-india-30647504" rel="published-1420102077277">India scheme to monitor toilet use </a>,
 <a class="story" href="/news/world-asia-india-30640444" rel="published-1420022868334">India to scrap tax breaks on cars</a>,
 <a class="story" href="/news/world-asia-india-30640436" rel="published-1420012598505">India shock over Dhoni retirement</a>,
 <a href="/news/world/asia/india/">India</a>,
 <a class="headline-anchor" href="/news/world-asia-india-30630274" rel="published-1419931669523"><img alt="A Delhi police officer with red flag walks amidst morning fog in Delhi, India, Monday, Dec 29, 2014. " src="http://news.bbcimg.co.uk/media/images/79979000/jpg/_79979280_79979240.jpg"/><span class="headline heading-13">India fog continues to cause chaos</span></a>,
 <a class="headline-anchor" href="/news/world-asia-india-30632852" rel="published-1419940599384"><span class="headline heading-13">Court boost to India BJP chief</span></a>,
 <a class="headline-anchor" href="/sport/0/cricket/30632182" rel="published-1419930930045"><span class="headline heading-13">India captain Dhoni quits Tests</span></a>,
 <a class="story" href="http://www.bbc.co.uk/news/world-radio-and-tv-15386555" rel="published-1392018507550"><img alt="A woman riding a scooter waits for a traffic signal along a street in Mumbai February 5, 2014." src="http://news.bbcimg.co.uk/media/images/72866000/jpg/_72866856_020889093.jpg"/>Special report: India Direct</a>,
 <a href="/2/hi/south_asia/country_profiles/1154019.stm">India</a>]

Note the http://www.bbc.co.uk/news/world-radio-and-tv-15386555 link here; we had to use the lambda search because a search with a text regular expression would not have found that element; the contained text (Special report: India Direct) is not the only element in the tag and thus would not be found.

A similar problem applies to the /news/world-asia-india-30632852 link; the nested <span> element makes it that the Court boost to India BJP chief headline text is not a direct child element of the link tag.

You can extract just the links with:

from urllib.parse import urljoin

result_links = [urljoin(url, tag['href']) for tag in results]

where all relative URLs are resolved relative to the original URL:

>>> from urllib.parse import urljoin
>>> result_links = [urljoin(url, tag['href']) for tag in results]
>>> pprint(result_links)
['http://www.bbc.com/news/world/asia/india/',
 'http://www.bbc.com/news/world-asia-india-30647504',
 'http://www.bbc.com/news/world-asia-india-30640444',
 'http://www.bbc.com/news/world-asia-india-30640436',
 'http://www.bbc.com/news/world/asia/india/',
 'http://www.bbc.com/news/world-asia-india-30630274',
 'http://www.bbc.com/news/world-asia-india-30632852',
 'http://www.bbc.com/sport/0/cricket/30632182',
 'http://www.bbc.co.uk/news/world-radio-and-tv-15386555',
 'http://www.bbc.com/2/hi/south_asia/country_profiles/1154019.stm']

Upvotes: 3

Related Questions