Nic Palvie
Nic Palvie

Reputation: 55

Extract specific text from URL Python

I'm trying to extract specific text from many urls that are being returned. Im using Python 2.7 with requests and BeautifulSoup.

The reason is i need to find the latest URL which can be identified by the highest number "DF_7" with 7 been the highest from the below urls.This url will then be downloaded. Note, each day new files are added, this is why i need to check for the one with the highest number.

Once i find the highest number in the list of URL's i then need to join this "https://service.rl360.com/scripts/customer.cgi/SC/servicing/" to the url with the highest number. The final product should look like this. https://service.rl360.com/scripts/customer.cgi/SC/servicing/downloads.php?Reference=DF_7&SortField=ExpiryDays&SortOrder=Ascending

The urls look like this just with DF_ incrementing each time

Is this the right approach? if so how do i go about doing this.

Thanks

import base
import requests
import zipfile, StringIO, re
from lxml import html
from bs4 import BeautifulSoup

from base import os

from django.conf import settings

# Fill in your details here to be posted to the login form.
payload = {
    'USERNAME': 'xxxxxx',
    'PASSWORD': 'xxxxxx',
    'option': 'login'
}

headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5)     AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'}

# Use 'with' to ensure the session context is closed after use.

with requests.Session() as s:
        p = s.post('https://service.rl360.com/scripts/customer.cgi?option=login', data=payload)

    # An authorised request.
    r = s.get('https://service.rl360.com/scripts/customer.cgi/SC/servicing/downloads.php?Folder=DataDownloads&SortField=ExpiryDays&SortOrder=Ascending', stream=True)
    content = r.text
    soup = BeautifulSoup(content, 'lxml')
    table = soup.find('table')
    links = table.find_all('a')
    print links

Upvotes: 2

Views: 689

Answers (1)

Dan-Dev
Dan-Dev

Reputation: 9430

You can go straight to the last link with the class "tableid" and print it's href value like this:

href = soup.find_all("a", {'class':'tabletd'})[-1]['href']
base = "https://service.rl360.com/scripts/customer.cgi/SC/servicing/"
print (base + href)

Upvotes: 1

Related Questions