Sasha Yubaiva
Sasha Yubaiva

Reputation: 65

How to get the last URL link element using BeautifulSoup

How do I get the last html link from a given page using BeautifulSoup? I am trying to get a link that contains lenta.ru in it. However, if a webpage contains more than one lenta.ru it prints every lenta.ru. However I would like to just get the last lenta.ru link which is the pointer link for the translation.

I am getting these results

http://lenta.ru/news/2012/09/03/ipsos/ https://uynaa.wordpress.com/2012/09/06/%d0%b0%d1%80%d0%b0%d0%b2%d0%b4%d1%83%d0%b3%d0%b0%d0%b0%d1%80-%d1%81%d0%b0%d1%80%d1%8b%d0%bd-%d0%b1%d1%8d%d0%bb%d1%8d%d0%b3/
http://lenta.ru/news/2012/09/04/endofobama/ https://uynaa.wordpress.com/2012/09/06/%d0%b0%d1%80%d0%b0%d0%b2%d0%b4%d1%83%d0%b3%d0%b0%d0%b0%d1%80-%d1%81%d0%b0%d1%80%d1%8b%d0%bd-%d0%b1%d1%8d%d0%bb%d1%8d%d0%b3/
http://lenta.ru/news/2012/09/04/response/ https://uynaa.wordpress.com/2012/09/06/%d0%b0%d1%80%d0%b0%d0%b2%d0%b4%d1%83%d0%b3%d0%b0%d0%b0%d1%80-%d1%81%d0%b0%d1%80%d1%8b%d0%bd-%d0%b1%d1%8d%d0%bb%d1%8d%d0%b3/
http://www.lenta.ru/articles/2012/09/05/threat/ https://uynaa.wordpress.com/2012/09/06/%d0%b0%d1%80%d0%b0%d0%b2%d0%b4%d1%83%d0%b3%d0%b0%d0%b0%d1%80-%d1%81%d0%b0%d1%80%d1%8b%d0%bd-%d0%b1%d1%8d%d0%bb%d1%8d%d0%b3/
http://lenta.ru/articles/2012/08/21/terranova/ https://uynaa.wordpress.com/2012/08/23/%d1%85%d2%af%d0%bd-%d0%b1%d0%b0-%d0%bc%d3%a9%d1%81/

expected output

http://www.lenta.ru/articles/2012/09/05/threat/ https://uynaa.wordpress.com/2012/09/06/%d0%b0%d1%80%d0%b0%d0%b2%d0%b4%d1%83%d0%b3%d0%b0%d0%b0%d1%80-%d1%81%d0%b0%d1%80%d1%8b%d0%bd-%d0%b1%d1%8d%d0%bb%d1%8d%d0%b3/
http://lenta.ru/articles/2012/08/21/terranova/ https://uynaa.wordpress.com/2012/08/23/%d1%85%d2%af%d0%bd-%d0%b1%d0%b0-%d0%bc%d3%a9%d1%81/

my code

import re
import requests
from lxml import html
from bs4 import BeautifulSoup
from urllib.request import urlopen

with open("./uynaa.txt") as inFile:
    uynaa_txt = inFile.readlines()

for tmp in uynaa_txt:

    html = urlopen(tmp).read()
    soup = BeautifulSoup(html, "lxml")

    for a in soup.select('div.entry a'):
        if "lenta.ru" in a.get('href', ''):
            print(a, tmp)

uynaa.txt

https://uynaa.wordpress.com/2012/09/06/%d0%b0%d1%80%d0%b0%d0%b2%d0%b4%d1%83%d0%b3%d0%b0%d0%b0%d1%80-%d1%81%d0%b0%d1%80%d1%8b%d0%bd-%d0%b1%d1%8d%d0%bb%d1%8d%d0%b3/
https://uynaa.wordpress.com/2012/08/23/%d1%85%d2%af%d0%bd-%d0%b1%d0%b0-%d0%bc%d3%a9%d1%81/

Upvotes: 2

Views: 127

Answers (1)

FBB
FBB

Reputation: 344

Solution

soup.select('div.entry a')[-1]

Explanation

soup.select returns a list. You can retrieve the last item in a list with [-1]. If the page only has one link that matches, the last item will also the the first item, but this shouldn't cause you any issues.

# full working code

from bs4 import BeautifulSoup
example_page = """
<body>
<a href="http://lenta.ru/news/2012/09/03/ipsos/ https://uynaa.wordpress.com/2012/09/06/%d0%b0%d1%80%d0%b0%d0%b2%d0%b4%d1%83%d0%b3%d0%b0%d0%b0%d1%80-%d1%81%d0%b0%d1%80%d1%8b%d0%bd-%d0%b1%d1%8d%d0%bb%d1%8d%d0%b3/"></a>
<a href="http://lenta.ru/news/2012/09/04/endofobama/ https://uynaa.wordpress.com/2012/09/06/%d0%b0%d1%80%d0%b0%d0%b2%d0%b4%d1%83%d0%b3%d0%b0%d0%b0%d1%80-%d1%81%d0%b0%d1%80%d1%8b%d0%bd-%d0%b1%d1%8d%d0%bb%d1%8d%d0%b3/" ></a>
<a href="http://lenta.ru/news/2012/09/04/response/ https://uynaa.wordpress.com/2012/09/06/%d0%b0%d1%80%d0%b0%d0%b2%d0%b4%d1%83%d0%b3%d0%b0%d0%b0%d1%80-%d1%81%d0%b0%d1%80%d1%8b%d0%bd-%d0%b1%d1%8d%d0%bb%d1%8d%d0%b3/" ></a>
</body>
"""
soup = BeautifulSoup(example_page, "lxml")

print(soup.body.select("a")[-1])

Upvotes: 3

Related Questions