Reputation: 819
I have a JSON file that store text data called stream_key.json
:
{"text":"RT @WBali: Ideas for easter? Digging in with Seminyak\u2019s best beachfront view? \nRSVP: b&[email protected] https:\/\/t.co\/fRoAanOkyC"}
As we can see that the text in the json file contain unicode \u2019
, I want to remove this code using regex in Python 2.7, this is my code so far (eraseunicode.py):
import re
import json
def removeunicode(text):
text = re.sub(r'\\[u]\S\S\S\S[s]', "", text)
text = re.sub(r'\\[u]\S\S\S\S', "", text)
return text
with open('stream_key.json', 'r') as f:
for line in f:
tweet = json.loads(line)
text = tweet['text']
text = removeunicode(text)
print(text)
The result i get is:
Traceback (most recent call last):
File "eraseunicode.py", line 17, in <module>
print(text)
File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2019' in position 53: character maps to <undefined>
As I already use function to remove the \u2019
before print, I don't understand why it is still error. Please Help. Thanks
Upvotes: 2
Views: 1526
Reputation: 140168
When the data is in a text file, \u2019
is a string. But once loaded in json
it becomes unicode and replacement doesn't work anymore.
So you have to apply your regex before loading into json and it works
tweet = json.loads(removeunicode(line))
of course it processes the entire raw line. You also can remove non-ascii chars from the decoded text
by checking character code like this (note that it is not strictly equivalent):
text = "".join([x for x in tweet['text'] if ord(x)<128])
Upvotes: 1