Extract specific text from URL Python

Question

I'm trying to extract specific text from many urls that are being returned. Im using Python 2.7 with requests and BeautifulSoup.

The reason is i need to find the latest URL which can be identified by the highest number "DF_7" with 7 been the highest from the below urls.This url will then be downloaded. Note, each day new files are added, this is why i need to check for the one with the highest number.

Once i find the highest number in the list of URL's i then need to join this "https://service.rl360.com/scripts/customer.cgi/SC/servicing/" to the url with the highest number. The final product should look like this. https://service.rl360.com/scripts/customer.cgi/SC/servicing/downloads.php?Reference=DF_7&SortField=ExpiryDays&SortOrder=Ascending

The urls look like this just with DF_ incrementing each time

Is this the right approach? if so how do i go about doing this.

Thanks

import base
import requests
import zipfile, StringIO, re
from lxml import html
from bs4 import BeautifulSoup

from base import os

from django.conf import settings

# Fill in your details here to be posted to the login form.
payload = {
    'USERNAME': 'xxxxxx',
    'PASSWORD': 'xxxxxx',
    'option': 'login'
}

headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5)     AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'}

# Use 'with' to ensure the session context is closed after use.

with requests.Session() as s:
        p = s.post('https://service.rl360.com/scripts/customer.cgi?option=login', data=payload)

    # An authorised request.
    r = s.get('https://service.rl360.com/scripts/customer.cgi/SC/servicing/downloads.php?Folder=DataDownloads&SortField=ExpiryDays&SortOrder=Ascending', stream=True)
    content = r.text
    soup = BeautifulSoup(content, 'lxml')
    table = soup.find('table')
    links = table.find_all('a')
    print links

Dan-Dev · Accepted Answer

You can go straight to the last link with the class "tableid" and print it's href value like this:

href = soup.find_all("a", {'class':'tabletd'})[-1]['href']
base = "https://service.rl360.com/scripts/customer.cgi/SC/servicing/"
print (base + href)

Extract specific text from URL Python

Answers (1)

Related Questions