Reputation: 1486
I have a html file like following:
<form action="/2811457/follow?gsid=3_5bce9b871484d3af90c89f37" method="post">
<div>
<a href="/2811457/follow?page=2&gsid=3_5bce9b871484d3af90c89f37">next_page</a>
<input name="mp" type="hidden" value="3" />
<input type="text" name="page" size="2" style='-wap-input-format: "*N"' />
<input type="submit" value="jump" /> 1/3
</div>
</form>
how to extract the "1/3" from the file?
It is a part of html,I intend to make it clear. When I use beautifulsoup,
I'm new to beautifulsoup,and I have look the document,but still confused.
how to extract"1/3" from the html file?
total_urls_num = re.findall('\d+/\d+',response)
work code:
from BeautifulSoup import BeautifulSoup
import re
with open("html.txt","r") as f:
response = f.read()
print response
soup = BeautifulSoup(response)
delete_urls = soup.findAll('a', href=re.compile('follow\?page')) #works,should escape ?
print delete_urls
#total_urls_num = re.findall('\d+/\d+',response)
total_urls_num = soup.find('input',type='submit')
print total_urls_num
Upvotes: 0
Views: 404
Reputation: 1931
Read this document
Not
total_urls_num = soup.find('input',style='submit') #can't work
You should use type
instead of style
>>>temp = soup.find('input',type='submit').next
' 1/3\n'
>>>re.findall('\d+/\d+', temp)
[u'1/3']
>>>re.findall('\d+/\d+', temp).[0]
u'1/3'
Upvotes: 0
Reputation: 352979
I think the problem is that the text you're searching for isn't the attribute of some tag, but comes after. You can access it using .next
:
In [144]: soup.find("input", type="submit")
Out[144]: <input type="submit" value="jump" />
In [145]: soup.find("input", type="submit").next
Out[145]: u' 1/3\n'
and you can then get the 1/3 from that however you like:
In [146]: re.findall('\d+/\d+', _)
Out[146]: [u'1/3']
or simply something like:
In [153]: soup.findAll("input", type="submit", text=re.compile("\d+/\d+"))
Out[153]: [u' 1/3\n']
Upvotes: 1