Reputation: 2027
I need a regex that will return to me the text contained between double quotes that starts with a specified text block, and ends with a specific file extension (say .txt). I'm using urllib2 to get the html of the page (the html is quite simple).
Basically if I have something like
<tr>
<td valign="top"><img src="/icons/unknown.gif" alt="[ ]"></td>
<td><a href="Client-8.txt">new_Client-8.txt</a></td>
<td align="right">27-Jun-2012 18:02 </td>
</tr>
It should just return to me
Client-8.txt
Where the returned value is contained within double quotes. I know how the file name starts "Client-", and the file extension ".txt".
I'm playing around with r.search(regex, string) where the string I input is the html of the page. But I stink at regular expressions.
Thanks!
Upvotes: 4
Views: 2713
Reputation: 251146
soup = BeautifulSoup('<tr><td valign="top"><img src="/icons/unknown.gif" alt="[ ]"></td><td><a href="Client-8.txt">new_Client-8.txt</a></td><td align="right">27-Jun-2012 18:02 </td>')
x=soup.findAll('a')
for i in x:
if '.txt' in i['href']:
print(i['href'])
Upvotes: 1
Reputation: 122516
You should not use regular expressions for this task. It's far easier to write a script with BeautifulSoup to process the HTML and to find the element(s) you need.
In your case, you should search for all <a>
elements whose href
attribute starts with Client-
and ends with .txt
. That will give you a list of all files.
Upvotes: 4