Reputation: 21
I've been working on a program to go through various links that I already have saved in a text file, which are mostly summer opportunities/camps/etc. and scrape through them to see if key words like "scholarship" or "financial aid" pop up. However, when I run through it, it gives me the error that's in the title above.
This question has been asked a few times, but it appears to be for different reasons for different people. Therefore, I get that there's probably an error involving Unicode, but I have no idea where or why that would be.
This is the code:
import BeautifulSoup
import requests
import nltk
file_from = open("links.txt", "r")
list_of_urls = file_from.read().splitlines()
aid_words = ["financial", "aid", "merit", "scholarship"]
count = 0
fin_aid = []
while count <= 10:
for url in list_of_urls:
clean = 1
result = "nothing found"
source = requests.get(url)
plain_text = source.text
soup = BeautifulSoup.BeautifulSoup(plain_text)
print (str(url).upper())
for links in soup.findAll('p', text = True):
tokenized_text = nltk.word_tokenize(links)
for word in tokenized_text:
if word not in aid_words:
print ("not it " + str(clean))
clean += 1
pass
else:
result = str(word)
print (result)
fin_aid.append(url)
break
count += 1
the_golden_book = {"link: ": str(url), "word found: ": str(result)}
fin_aid.append(the_golden_book)
file_to = open("links_with_aid.txt", "w")
file_to.write(str(fin_aid))
file_to.close()
print ("scrape finished")
print (str(fin_aid))
Basically, I wanted to take all the links from links.txt, visit the first ten (as a test), search for the four words in the list "aid_words", and return results in the form of "not it" and the number of words searched so far, if none of the words have been found yet, or the word that was detected if one is found (so that I can visit the link later and search for it, to see if it's a false alarm or not).
When I run this through the Command Prompt, this is the stuff it shows me right before the error message.
Traceback (most recent call last):
File "finaid.py", line 20, in <module>
soup = BeautifulSoup.BeautifulSoup(plain_text.encode("utf-8"))
File "C:\Python27\lib\site-packages\BeautifulSoup.py", line 1522, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "C:\Python27\lib\site-packages\BeautifulSoup.py", line 1147, in __init__
self._feed(isHTML=isHTML)
File "C:\Python27\lib\site-packages\BeautifulSoup.py", line 1189, in _feed
SGMLParser.feed(self, markup)
File "C:\Python27\lib\sgmllib.py", line 104, in feed
self.goahead(0)
File "C:\Python27\lib\sgmllib.py", line 143, in goahead
k = self.parse_endtag(i)
File "C:\Python27\lib\sgmllib.py", line 320, in parse_endtag
self.finish_endtag(tag)
File "C:\Python27\lib\sgmllib.py", line 358, in finish_endtag
method = getattr(self, 'end_' + tag)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 7-9: ordin
al not in range(128)
I'm running this on Python 2.7.10, and I'm on Windows 8.1. Thanks for any help you can provide! As far as I can tell, it shouldn't be anything in "link.txt", which is literally just links that a colleague crawled and saved earlier.
Upvotes: 0
Views: 1164
Reputation: 61
I do quite a bit of website scraping and I can tell you this: please try to write your scraper code using Python 3. As soon as I updated my scrapers to use Python 3 a lot of my encoding issues went away. Be sure on your file write to use 'a' instead of 'w' if you do go to Python 3 and you want to keep the contents of that file intact.
Let me know if you have specific questions about making that transition.
On the "expected string or buffer," that usually shows up for me when I pass in an object instead of a string. To check that is happening, use a print statement to check, like so:
for links in soup.findAll('p', text = True):
print(links)
tokenized_text = nltk.word_tokenize(links)
If it doesn't print text to your terminal (or wherever you are running the script from) then you are passing in an object when it is expecting to receive a string.
Pseudo-code to fix it might look like:
for links in soup.findAll('p', text = True):
links = links.text()
tokenized_text = nltk.word_tokenize(links)
Upvotes: 1