Reputation: 311

How to process URLs with Python

I have the following code (doop.py), which strips a .html file of all the 'nonsense' html script, outputting only the 'human-readable' text; eg. it will take a file containing the following:

<html>
<body>

<a href="http://www.w3schools.com">
This is a link</a>

</body>
</html>

and give

$ ./doop.py
File name: htmlexample.html

This is a link

The next thing I need to do is add a function that, if any of the html arguments within the file represent a URL (a web address), the program will read the content of the designated webpage instead of a disk file. (For present purposes, it is sufficient for doop.py to recognize an argument beginning with http:// (in any mixture of letter-cases) as a URL.)

I'm not sure where to start with this - I'm sure it would involve telling python to open a URL, but how do I do that?

Thanks,

Upvotes: 0

Answers (4)

Tom

Reputation: 6981

As with most things pythonic: there is a library for that.

Here you need the urllib2 library

This allows you to open a url like a file, and read and writ from it like a file.

The code you would need would look something like this:

import urllib2

urlString = "http://www.my.url"
try:
    f = urllib2.urlopen(urlString)  #open url
    pageString = f.read()           #read content
    f.close()                       #close url
    readableText = getReadableText(pageString)
    #continue using the pageString as you wish
except IOException:
    print("Bad URL")

Update: (I don't have a python interpreter to hand, so can't test that this code will work or not, but it should!!) Opening the URL is the easy part, but first you need to extract the URLs from your html file. This is done using regular expressions (regex's), and unsurprisingly, python has a library for that (re). I recommend that you read up on both regex's, but they are basically a patter against which you can match text.

So what you need to do is write a regex that matches URLs:

(http|ftp|https)://[\w-_]+(.[\w-_]+)+([\w-.,@?^=%&:/~+#]*[\w-\@?^=%&/~+#])? If you don't want to follow urls to ftp resources, then remove "ftp|" from the beginning of the pattern. Now you can scan your input file for all character sequences that match this pattern:

import re

input_file_str = #open your input file and read its contents
pattern = re.compile("(http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])?") #compile the pattern matcher
matches = pattern.findall(input_file_str) #find all matches, storing them in an interator
for match in matches :  #go through iteratr
    urlString = match   #get the string that matched the pattern
    #use the code above to load the url using matched string!

That should do it

Upvotes: 1

Vikas

Reputation: 8958

Apart from urllib2 that others already mentioned, you can take a look at Requests module by Kenneth Reitz. It has a more concise and expressive syntax than urllib2.

import requests
r = requests.get('https://api.github.com', auth=('user', 'pass'))
r.text

Upvotes: 2

Christian Witts

Reputation: 11585

Rather than write your own HTML Parser / Scraper, I would personally recommend Beautiful Soup which you can use to load up your HTML, get the elements you want out of it, find all the links, and then use urllib to fetch the new links for you to parse and process further.

Upvotes: 0