ZacAttack
ZacAttack

Reputation: 2027

Python find file download link on webpage

I need a regex that will return to me the text contained between double quotes that starts with a specified text block, and ends with a specific file extension (say .txt). I'm using urllib2 to get the html of the page (the html is quite simple).

Basically if I have something like

<tr>
  <td valign="top"><img src="/icons/unknown.gif" alt="[   ]"></td>
  <td><a href="Client-8.txt">new_Client-8.txt</a></td>
  <td align="right">27-Jun-2012 18:02  </td>
</tr>

It should just return to me

Client-8.txt

Where the returned value is contained within double quotes. I know how the file name starts "Client-", and the file extension ".txt".

I'm playing around with r.search(regex, string) where the string I input is the html of the page. But I stink at regular expressions.

Thanks!

Upvotes: 4

Views: 2713

Answers (2)

Ashwini Chaudhary
Ashwini Chaudhary

Reputation: 251146

soup = BeautifulSoup('<tr><td valign="top"><img src="/icons/unknown.gif" alt="[   ]"></td><td><a href="Client-8.txt">new_Client-8.txt</a></td><td align="right">27-Jun-2012 18:02  </td>')
x=soup.findAll('a')
for i in x:
    if '.txt' in i['href']:
        print(i['href'])

Upvotes: 1

Simeon Visser
Simeon Visser

Reputation: 122516

You should not use regular expressions for this task. It's far easier to write a script with BeautifulSoup to process the HTML and to find the element(s) you need.

In your case, you should search for all <a> elements whose href attribute starts with Client- and ends with .txt. That will give you a list of all files.

Upvotes: 4

Related Questions