ytomo
ytomo

Reputation: 819

Removing Unicode \uxxxx in String from JSON Using Regex

I have a JSON file that store text data called stream_key.json :

{"text":"RT @WBali: Ideas for easter? Digging in with Seminyak\u2019s best beachfront view? \nRSVP: b&[email protected] https:\/\/t.co\/fRoAanOkyC"}

As we can see that the text in the json file contain unicode \u2019, I want to remove this code using regex in Python 2.7, this is my code so far (eraseunicode.py):

import re
import json

def removeunicode(text):
    text = re.sub(r'\\[u]\S\S\S\S[s]', "", text)
    text = re.sub(r'\\[u]\S\S\S\S', "", text)
    return text

with open('stream_key.json', 'r') as f:
    for line in f:
        tweet = json.loads(line)
        text = tweet['text']
        text = removeunicode(text)
        print(text)

The result i get is:

Traceback (most recent call last):
  File "eraseunicode.py", line 17, in <module>
    print(text)
  File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2019' in position 53: character maps to <undefined>

As I already use function to remove the \u2019 before print, I don't understand why it is still error. Please Help. Thanks

Upvotes: 2

Views: 1526

Answers (1)

Jean-Fran&#231;ois Fabre
Jean-Fran&#231;ois Fabre

Reputation: 140168

When the data is in a text file, \u2019 is a string. But once loaded in json it becomes unicode and replacement doesn't work anymore.

So you have to apply your regex before loading into json and it works

tweet = json.loads(removeunicode(line))

of course it processes the entire raw line. You also can remove non-ascii chars from the decoded text by checking character code like this (note that it is not strictly equivalent):

 text = "".join([x for x in tweet['text'] if ord(x)<128])

Upvotes: 1

Related Questions