Aisyah
Aisyah

Reputation: 51

PYTHON Tensorflow, Text analyzing: Non-ASCII character '\xc3' in file

I have the most basic knowledge with Python and working on tweets analyzing API. I found a NLP tutorial where it uses T-SNE and word2vec. Reference to my system posted on Stackoverflow before.

I followed the tutorial step-by-step, but upon running the code, I encountered an error:

Non-ASCII character '\xc3' in file

Is there a reason to this? Code snippet is as below.

def process_raw_data(input_file):
  valid = u"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ#@.:/ äöåÄÖÅ"
  url_match = "(https?:\/\/[0-9a-zA-Z\-\_]+\.[\-\_0-9a-zA-Z]+\.?[0-9a-zA-Z\-\_]*\/?.*)"
  name_match = "\@[\_0-9a-zA-Z]+\:?"
  lines = []
  print("Loading raw data from: " + input_file)
  if os.path.exists(input_file):
    with io.open(input_file, 'r', encoding="utf-8") as f:
      lines = f.readlines()
  num_lines = len(lines)
  ret = []
  for count, text in enumerate(lines):
    if count % 50 == 0:
      print_progress(count, num_lines)
    text = re.sub(url_match, u"", text)
    text = re.sub(name_match, u"", text)
    text = re.sub("\&amp\;?", u"", text)
    text = re.sub("[\:\.]{1,}$", u"", text)
    text = re.sub("^RT\:?", u"", text)
    text = u''.join(x for x in text if x in valid)
    text = text.strip()
    if len(text.split()) > 5:
      if text not in ret:
        ret.append(text)
  return ret

Upvotes: 0

Views: 230

Answers (1)

Andrey
Andrey

Reputation: 6367

Your input_file probably has different encoding (not utf-8).

Upvotes: 1

Related Questions