Phi
Phi

Reputation: 335

Scraping hidden values from HTML with Python Using Requests and bs4 Lib

I am trying to scrape a Captcha code from an html source that has the code in the following format.

<div id="Custom"><!-- test: vdfnhu --></div>

The captcha code changes with each refresh. My intent is to capture the captcha and it's validation code in order to post to a form.

My code so far is:

import requests
import urlparse
import lxml.html
import sys
from bs4 import BeautifulSoup

print "Enter the URL",
url = raw_input()
r = requests.get(url)
c = r.content
soup = BeautifulSoup(c)
div = soup.find('div' , id ='Custom')
comment = next(div.children)
test = comment.partition(':')[-1].strip()
print test

Upvotes: 0

Views: 2175

Answers (1)

abarnert
abarnert

Reputation: 365717

As the documentation explains, BeautifulSoup has NavigableString and Comment objects, just like Tag objects, and they can all be children, siblings, etc. Comments and other special strings has more details.

So, you want to find the div 'Custom':

div = soup.find('div', id='Custom'}

And then you want to find the find Comment child:

comment = next(child for child in div.children if isinstance(child, bs4.Comment))

Although if the format is as fixed and invariable as you're presenting it, you may want to simplify that to just next(div.children). On the other hand, if it's more variable, you may want to iterate over all Comment nodes, not just grab the first.

And, since a Comment is basically just a string (as in, it supports all str methods):

test = comment.partition(':')[-1].strip()

Putting it together:

>>> html = '''<html><head></head>
...           <body><div id="Custom"><!-- test: vdfnhu --></div>\n</body></html>'''
>>> soup = bs4.BeautifulSoup(html)
>>> div = bs4.find('div', id='Custom')
>>> comment = next(div.children)
>>> test = comment.partition(':')[-1].strip()
>>> test
'vdfnhu'

Upvotes: 2

Related Questions