Dor Zluf
Dor Zluf

Reputation: 1

Can't solve: TypeError: expected string or buffer

I'm trying to write a code to scrap numbers from HTML by finding the span tags and the numbers within them.

I keep getting the error "expected string or buffer".

I've read some solution while doing my search through different question, but when I try " ''.join(some_list)" i'm getting another error:

"sequence item 0: expected string, Tag found"

Tried to search for that one, saw some solutions like using .get instead of re.findall, but the error keep to appear.

The code:

import urllib
from BeautifulSoup import *
url = raw_input('Enter the URL:')
stri = urllib.urlopen(url).read()
soup = BeautifulSoup(stri)

#retrieve of the span tags

spans = ''.join(soup('span'))
numlist = list()
for tag in spans:
    num = int(re.findall('[0-9]+', tag))
    numlist.append(num)
print(numlist)

I saw several solutions for those type of errors, but can't seem to solve it.

What am I missing?

I added the tag.text, and the error has changed to another one, now i'm getting: "Errno 11004] getaddrinfo failed"

I looked at different posts but couldn't solve it, so I ran the code line by line to see where's the problem is, and I found that it appears when i'm running the fourth sentence in the original code:

html = urllib.urlopen(url).read()

Please help?

Upvotes: 0

Views: 992

Answers (1)

Alex Hall
Alex Hall

Reputation: 36033

tag is a Tag object, which contains lots of information, not just a string. If you want the text inside the tag without any markup, use tag.text, e.g.:

spans = ''.join(tag.text for tag in soup('span'))
# now `for tag in spans:` makes no sense because spans is a string

or

spans = soup('span')
for tag in spans:
    num = len(re.findall('[0-9]+', tag.text))  # note len, not int

Upvotes: 1

Related Questions