Reputation: 1
I'm trying to write a code to scrap numbers from HTML by finding the span tags and the numbers within them.
I keep getting the error "expected string or buffer".
I've read some solution while doing my search through different question, but when I try " ''.join(some_list)"
i'm getting another error:
"sequence item 0: expected string, Tag found"
Tried to search for that one, saw some solutions like using .get
instead of re.findall
, but the error keep to appear.
The code:
import urllib
from BeautifulSoup import *
url = raw_input('Enter the URL:')
stri = urllib.urlopen(url).read()
soup = BeautifulSoup(stri)
#retrieve of the span tags
spans = ''.join(soup('span'))
numlist = list()
for tag in spans:
num = int(re.findall('[0-9]+', tag))
numlist.append(num)
print(numlist)
I saw several solutions for those type of errors, but can't seem to solve it.
What am I missing?
I added the tag.text, and the error has changed to another one, now i'm getting: "Errno 11004] getaddrinfo failed"
I looked at different posts but couldn't solve it, so I ran the code line by line to see where's the problem is, and I found that it appears when i'm running the fourth sentence in the original code:
html = urllib.urlopen(url).read()
Please help?
Upvotes: 0
Views: 992
Reputation: 36033
tag
is a Tag
object, which contains lots of information, not just a string. If you want the text inside the tag without any markup, use tag.text
, e.g.:
spans = ''.join(tag.text for tag in soup('span'))
# now `for tag in spans:` makes no sense because spans is a string
or
spans = soup('span')
for tag in spans:
num = len(re.findall('[0-9]+', tag.text)) # note len, not int
Upvotes: 1