Python find file download link on webpage

Question

I need a regex that will return to me the text contained between double quotes that starts with a specified text block, and ends with a specific file extension (say .txt). I'm using urllib2 to get the html of the page (the html is quite simple).

Basically if I have something like


  
  new_Client-8.txt
  27-Jun-2012 18:02

It should just return to me

Client-8.txt

Where the returned value is contained within double quotes. I know how the file name starts "Client-", and the file extension ".txt".

I'm playing around with r.search(regex, string) where the string I input is the html of the page. But I stink at regular expressions.

Thanks!

Simeon Visser · Accepted Answer

You should not use regular expressions for this task. It's far easier to write a script with BeautifulSoup to process the HTML and to find the element(s) you need.

In your case, you should search for all elements whose href attribute starts with Client- and ends with .txt. That will give you a list of all files.

Python find file download link on webpage

Answers (2)

Related Questions