Compoot
Compoot

Reputation: 2377

Python extract title from URL

I am using the following function to try and extract titles from a list of web scraped urls.

I did look at some SO answers, but notice many suggest avoiding regex solutions. I would like to fix and build on my existing solution, but am happy for additional elegant solutions to be suggested.

Example url 1: https://upload.wikimedia.org/wikipedia/commons/thumb/b/bd/Rembrandt_van_Rijn_-_Self-Portrait_-_Google_Art_Project.jpg/220px-Rembrandt_van_Rijn_-_Self-Portrait_-_Google_Art_Project.jpg

Example url 2: https://upload.wikimedia.org/wikipedia/commons/thumb/a/ae/Rembrandt_-_Rembrandt_and_Saskia_in_the_Scene_of_the_Prodigal_Son_-_Google_Art_Project.jpg/220px-Rembrandt_-_Rembrandt_and_Saskia_in_the_Scene_of_the_Prodigal_Son_-_Google_Art_Project.jpg

Code (function) that seeks to extract the title from the url.

def titleextract(url):
    #return unquote(url[58:url.rindex("/",58)-8].replace('_',''))
    cleanedtitle1=url[58:]
    title= cleanedtitle1.strip("-_Google_Art_Project.jpg/220px-")
    return title

The above has the following effect on the URLs:

Url 1: Rembrandt_-Rembrandt_and_Saskia_in_the_Scene_of_the_Prodigal_Son-Google_Art_Project.jpg/220px-Rembrandt-Rembrandt_and_Saskia_in_the_Scene_of_the_Prodigal_Son-_Google_Art_Project.jpg

Url 2: Rembrandt_van_Rijn_-Saskia_van_Uylenburgh%2C_the_Wife_of_the_Artist-Google_Art_Project.jpg/220px-Rembrandt_van_Rijn-Saskia_van_Uylenburgh%2C_the_Wife_of_the_Artist-_Google_Art_Project.jpg

The desired output however is:

Url 1: Rembrandt_-_Rembrandt_and_Saskia_in_the_Scene_of_the_Prodigal_Son

Url 2: Rembrandt_van_Rijn_-_Saskia_van_Uylenburgh2C_the_Wife_of_the_Artist

What I am struggling with, is removing everything after this: _-Google_Art_Project.jpg/220px-Rembrandt-Rembrandt_and_Saskia_in_the_Scene_of_the_Prodigal_Son-_Google_Art_Project.jpg for each unique case and also then removing the unwanted characters IF they exist, e.g % in url2.

Ideally, I would also like to get rid of the underscores in the title.

Any suggestions using my existing code with a suitable step by step explanation would be appreciated.

My attempt to remove the beginning has worked:

cleanedtitle1=url[58:]

But I have tried various things to strip the characters and remove the end, which have not worked:

title= cleanedtitle1.strip("-_Google_Art_Project.jpg/220px-")

Based on one suggestion, I also tried:

return unquote(url[58:url.rindex("/",58)-8].replace('_',''))

..but this does not remove the undesired text correctly, just the last 8 characters, but as it is variable this does not work.

I also tried this, to remove the underscores, again - no luck.

cleanedtitle1=url[58:]
    cleanedtitle2= cleanedtitle1.strip("-_Google_Art_Project.jpg/220px-")
    title = cleanedtitle2.strip("_")
    return title

My imports so far are:

from flask import Flask, render_template,url_for #importing flask class
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
from urllib.parse import unquote

I would be happy to accept a re related answer, for learning purposes, but would ideally also like something that completes what I've already started.

For answers using BeautifulSoup only, here is the whole code for completeness (this would also be very useful for reference)

from flask import Flask, render_template,url_for #importing flask class
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
from urllib.parse import unquote

app = Flask(__name__) #setting app variable to instance of flask class

@app.route('/') #this is what we type into our browser to go to pages. we create these using routes
@app.route('/home')
def home():
    images=imagescrape()
    titles=(titleextract(src) for src in images)
    images_titles=zip(images,titles)
    return render_template('home.html',images=images,images_titles=images_titles)   

def titleextract(url):
    pos1 = url.rindex("/")
    pos2 = url[:pos1].rindex("/")
    cleanedtitle1 = url[pos2 + 1: pos1]
    title = cleanedtitle1.replace("_-_Google_Art_Project.jpg", "")
    title = title.replace("_", " ")
    return title


def imagescrape():
    result_images=[]
    html = urlopen('https://en.wikipedia.org/wiki/Rembrandt')
    bs = BeautifulSoup(html, 'html.parser')
    images = bs.find_all('img', {'src':re.compile('.jpg')})
    for image in images:
        result_images.append("https:"+image['src']+'\n') #concatenation!
    return result_images

Upvotes: 0

Views: 1559

Answers (3)

alani
alani

Reputation: 13079

Starting with your:

cleanedtitle1=url[58:]

This would work, but it's probably not very robust to hard-code numbers, so let's instead start at the character after the second-to-last "/".

You could do this with regular expressions, but more simply, this might look like:

pos1 = url.rindex("/")  # index of last /
pos2 = url[:pos1].rindex("/")  # index of second-to-last /
cleanedtitle1 = url[pos2 + 1:]

Although actually, you're only interested in the bit between the second-to-last and the last /, so let's change make use of the pos1 that we found as an intermediate:

pos1 = url.rindex("/")  # index of last /
pos2 = url[:pos1].rindex("/")  # index of second-to-last /
cleanedtitle1 = url[pos2 + 1: pos1]

Here, this gives the following value for cleanedtitle1

'Rembrandt_van_Rijn_-_Self-Portrait_-_Google_Art_Project.jpg'

Now onto your strip. This won't do quite what you want: it will loop over the string you gave it, giving the individual characters in that string, and will then remove all occurrences of each of those characters.

So instead, let's use replace, and replace the string with an empty string.

title = cleanedtitle1.replace("_-_Google_Art_Project.jpg", "")

We can also then do similarly:

title = title.replace("_", " ")

Then we end up with:

'Rembrandt van Rijn - Self-Portrait'

Putting it together:

pos1 = url.rindex("/")
pos2 = url[:pos1].rindex("/")
cleanedtitle1 = url[pos2 + 1: pos1]
title = cleanedtitle1.replace("_-_Google_Art_Project.jpg", "")
title = title.replace("_", " ")
return title

Update

I missed the fact that the URL may contain sequences such as %2C that we wish to replace.

These could be done using replace in the same way, for example:

url = url.replace("%2C", ",")

but then you would have to do it for all of the similar sequences that might occur, so it's better to use the unquote function available from urllib. If at the top of your code you put:

from urllib.parse import unquote

then you can do these replacements using

url = unquote(url)

before the rest of the processing:

from urllib.parse import unquote

def titleextract(url):
    url = unquote(url)
    pos1 = url.rindex("/")
    pos2 = url[:pos1].rindex("/")
    cleanedtitle1 = url[pos2 + 1: pos1]
    title = cleanedtitle1.replace("_-_Google_Art_Project.jpg", "")
    title = title.replace("_", " ")
    return title

Upvotes: 2

JimmyCarlos
JimmyCarlos

Reputation: 1952

This should work, let me know any questions

def titleextract(url):
    title = url[58:]
    if "Google_Art_Project" in title:
        x = title.index("-_Google_Art_Project.jpg")
        title = title[:x] # Cut after where this is.

    disallowed_chars = "%" # Edit which chars should go.
    # Python will look at each character in turn. If it is not in the disallowed chars string, 
    # then it will be left. "".join() joins together all chars still allowed. 
    title = "".join(c for c in title if c not in disallowed_chars)

    title = title.replace("_"," ") # Change underscores to spaces.
    return title

Upvotes: 1

Karan Singh
Karan Singh

Reputation: 1164

There are a few ways to do this:

  1. If you just want to use the inbuilt python string functions, then you could do it by first splitting everything on the basis of / and then stripping the common part across all URL's.
def titleextract(url):
    cleanedtitle1 = url.split("/")[-1]
    return cleanedtitle1[6:-4].replace('_',' ')
  1. Since you are already using a bs4 import you could do it by:
soup = BeautifulSoup(htmlString, 'html.parser')
title = soup.title.text

Upvotes: 0

Related Questions