Reputation: 65
How do I get the last html link from a given page using BeautifulSoup? I am trying to get a link that contains lenta.ru in it. However, if a webpage contains more than one lenta.ru it prints every lenta.ru. However I would like to just get the last lenta.ru link which is the pointer link for the translation.
I am getting these results
http://lenta.ru/news/2012/09/03/ipsos/ https://uynaa.wordpress.com/2012/09/06/%d0%b0%d1%80%d0%b0%d0%b2%d0%b4%d1%83%d0%b3%d0%b0%d0%b0%d1%80-%d1%81%d0%b0%d1%80%d1%8b%d0%bd-%d0%b1%d1%8d%d0%bb%d1%8d%d0%b3/
http://lenta.ru/news/2012/09/04/endofobama/ https://uynaa.wordpress.com/2012/09/06/%d0%b0%d1%80%d0%b0%d0%b2%d0%b4%d1%83%d0%b3%d0%b0%d0%b0%d1%80-%d1%81%d0%b0%d1%80%d1%8b%d0%bd-%d0%b1%d1%8d%d0%bb%d1%8d%d0%b3/
http://lenta.ru/news/2012/09/04/response/ https://uynaa.wordpress.com/2012/09/06/%d0%b0%d1%80%d0%b0%d0%b2%d0%b4%d1%83%d0%b3%d0%b0%d0%b0%d1%80-%d1%81%d0%b0%d1%80%d1%8b%d0%bd-%d0%b1%d1%8d%d0%bb%d1%8d%d0%b3/
http://www.lenta.ru/articles/2012/09/05/threat/ https://uynaa.wordpress.com/2012/09/06/%d0%b0%d1%80%d0%b0%d0%b2%d0%b4%d1%83%d0%b3%d0%b0%d0%b0%d1%80-%d1%81%d0%b0%d1%80%d1%8b%d0%bd-%d0%b1%d1%8d%d0%bb%d1%8d%d0%b3/
http://lenta.ru/articles/2012/08/21/terranova/ https://uynaa.wordpress.com/2012/08/23/%d1%85%d2%af%d0%bd-%d0%b1%d0%b0-%d0%bc%d3%a9%d1%81/
expected output
http://www.lenta.ru/articles/2012/09/05/threat/ https://uynaa.wordpress.com/2012/09/06/%d0%b0%d1%80%d0%b0%d0%b2%d0%b4%d1%83%d0%b3%d0%b0%d0%b0%d1%80-%d1%81%d0%b0%d1%80%d1%8b%d0%bd-%d0%b1%d1%8d%d0%bb%d1%8d%d0%b3/
http://lenta.ru/articles/2012/08/21/terranova/ https://uynaa.wordpress.com/2012/08/23/%d1%85%d2%af%d0%bd-%d0%b1%d0%b0-%d0%bc%d3%a9%d1%81/
my code
import re
import requests
from lxml import html
from bs4 import BeautifulSoup
from urllib.request import urlopen
with open("./uynaa.txt") as inFile:
uynaa_txt = inFile.readlines()
for tmp in uynaa_txt:
html = urlopen(tmp).read()
soup = BeautifulSoup(html, "lxml")
for a in soup.select('div.entry a'):
if "lenta.ru" in a.get('href', ''):
print(a, tmp)
uynaa.txt
https://uynaa.wordpress.com/2012/09/06/%d0%b0%d1%80%d0%b0%d0%b2%d0%b4%d1%83%d0%b3%d0%b0%d0%b0%d1%80-%d1%81%d0%b0%d1%80%d1%8b%d0%bd-%d0%b1%d1%8d%d0%bb%d1%8d%d0%b3/
https://uynaa.wordpress.com/2012/08/23/%d1%85%d2%af%d0%bd-%d0%b1%d0%b0-%d0%bc%d3%a9%d1%81/
Upvotes: 2
Views: 127
Reputation: 344
soup.select('div.entry a')[-1]
soup.select
returns a list. You can retrieve the last item in a list with [-1].
If the page only has one link that matches, the last item will also the the first item, but this shouldn't cause you any issues.
# full working code
from bs4 import BeautifulSoup
example_page = """
<body>
<a href="http://lenta.ru/news/2012/09/03/ipsos/ https://uynaa.wordpress.com/2012/09/06/%d0%b0%d1%80%d0%b0%d0%b2%d0%b4%d1%83%d0%b3%d0%b0%d0%b0%d1%80-%d1%81%d0%b0%d1%80%d1%8b%d0%bd-%d0%b1%d1%8d%d0%bb%d1%8d%d0%b3/"></a>
<a href="http://lenta.ru/news/2012/09/04/endofobama/ https://uynaa.wordpress.com/2012/09/06/%d0%b0%d1%80%d0%b0%d0%b2%d0%b4%d1%83%d0%b3%d0%b0%d0%b0%d1%80-%d1%81%d0%b0%d1%80%d1%8b%d0%bd-%d0%b1%d1%8d%d0%bb%d1%8d%d0%b3/" ></a>
<a href="http://lenta.ru/news/2012/09/04/response/ https://uynaa.wordpress.com/2012/09/06/%d0%b0%d1%80%d0%b0%d0%b2%d0%b4%d1%83%d0%b3%d0%b0%d0%b0%d1%80-%d1%81%d0%b0%d1%80%d1%8b%d0%bd-%d0%b1%d1%8d%d0%bb%d1%8d%d0%b3/" ></a>
</body>
"""
soup = BeautifulSoup(example_page, "lxml")
print(soup.body.select("a")[-1])
Upvotes: 3