Reputation: 51
I have the most basic knowledge with Python and working on tweets analyzing API. I found a NLP tutorial where it uses T-SNE and word2vec. Reference to my system posted on Stackoverflow before.
I followed the tutorial step-by-step, but upon running the code, I encountered an error:
Non-ASCII character '\xc3' in file
Is there a reason to this? Code snippet is as below.
def process_raw_data(input_file):
valid = u"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ#@.:/ äöåÄÖÅ"
url_match = "(https?:\/\/[0-9a-zA-Z\-\_]+\.[\-\_0-9a-zA-Z]+\.?[0-9a-zA-Z\-\_]*\/?.*)"
name_match = "\@[\_0-9a-zA-Z]+\:?"
lines = []
print("Loading raw data from: " + input_file)
if os.path.exists(input_file):
with io.open(input_file, 'r', encoding="utf-8") as f:
lines = f.readlines()
num_lines = len(lines)
ret = []
for count, text in enumerate(lines):
if count % 50 == 0:
print_progress(count, num_lines)
text = re.sub(url_match, u"", text)
text = re.sub(name_match, u"", text)
text = re.sub("\&\;?", u"", text)
text = re.sub("[\:\.]{1,}$", u"", text)
text = re.sub("^RT\:?", u"", text)
text = u''.join(x for x in text if x in valid)
text = text.strip()
if len(text.split()) > 5:
if text not in ret:
ret.append(text)
return ret
Upvotes: 0
Views: 230