Reputation: 335
I am trying to scrape a Captcha code from an html source that has the code in the following format.
<div id="Custom"><!-- test: vdfnhu --></div>
The captcha code changes with each refresh. My intent is to capture the captcha and it's validation code in order to post to a form.
My code so far is:
import requests
import urlparse
import lxml.html
import sys
from bs4 import BeautifulSoup
print "Enter the URL",
url = raw_input()
r = requests.get(url)
c = r.content
soup = BeautifulSoup(c)
div = soup.find('div' , id ='Custom')
comment = next(div.children)
test = comment.partition(':')[-1].strip()
print test
Upvotes: 0
Views: 2175
Reputation: 365717
As the documentation explains, BeautifulSoup has NavigableString
and Comment
objects, just like Tag
objects, and they can all be children, siblings, etc. Comments and other special strings has more details.
So, you want to find the div 'Custom':
div = soup.find('div', id='Custom'}
And then you want to find the find Comment
child:
comment = next(child for child in div.children if isinstance(child, bs4.Comment))
Although if the format is as fixed and invariable as you're presenting it, you may want to simplify that to just next(div.children)
. On the other hand, if it's more variable, you may want to iterate over all Comment
nodes, not just grab the first.
And, since a Comment
is basically just a string (as in, it supports all str
methods):
test = comment.partition(':')[-1].strip()
Putting it together:
>>> html = '''<html><head></head>
... <body><div id="Custom"><!-- test: vdfnhu --></div>\n</body></html>'''
>>> soup = bs4.BeautifulSoup(html)
>>> div = bs4.find('div', id='Custom')
>>> comment = next(div.children)
>>> test = comment.partition(':')[-1].strip()
>>> test
'vdfnhu'
Upvotes: 2