Reputation: 243
I'm using Python and BeautifulSoup to scrape a web page for a small project of mine. The webpage has multiple entries, each separated by a table row in HTML. My code partially works however a lot of the output is blank and it won't fetch all of the results from the web page or even gather them into the same line.
<html>
<head>
<title>Sample Website</title>
</head>
<body>
<table>
<td class=channel>Artist</td><td class=channel>Title</td><td class=channel>Date</td><td class=channel>Time</td></tr>
<tr><td>35</td><td>Lorem Ipsum</td><td><a href="#" onClick="searchDB('LoremIpsum','FooWorld')">FooWorld</a></td><td>12/10/2014</td><td>2:53:17 PM</td></tr>
</table>
</body>
</html>
I want to only extract the values from the onclick action 'searchDB', so for example 'LoremIpsum' and 'FooWorld' are the only two results that I want.
Here is the code that I've written. So far, it properly pulls some of the write values, but sometimes the values are empty.
response = urllib2.urlopen(url)
html = response.read()
soup = bs4.BeautifulSoup(html)
properties = soup.findAll('a', onclick=True)
for eachproperty in properties:
print re.findall("'([a-zA-Z0-9]*)'", eachproperty['onclick'])
What am I doing wrong?
Upvotes: 5
Views: 17289
Reputation: 361
Try this
or row in rows[1:]:
cols = row.findAll('td')
link = cols[1].find('a').get('onclick')
Upvotes: 3
Reputation: 19763
try like this:
>>> import re
>>> for x in soup.find_all('a'): # will give you all a tag
... try:
... if re.match('searchDB',x['onclick']): # if onclick attribute exist, it will match for searchDB, if success will print
... print x['onclick'] # here you can do your stuff instead of print
... except:pass
...
searchDB('LoremIpsum','FooWorld')
instead of print you can save it to some variable like
>>> k = x['onclick']
>>> re.findall("'(\w+)'",k)
['LoremIpsum', 'FooWorld']
\w
is equivalent to [a-zA-Z0-9]
Upvotes: 8