Reputation: 2291
So I have occurrence of strings starting with \u
followed by various forms of 4 character hexadecimals (They are not unicode objects, but actual strings in the data, which is why I would like to clean up the data) and would like to replace that occurrences with white spaces.
Example textfile: Hello \u2022 Created, reviewed, \u00e9executed and maintained
For eg: there would be occurrences of strings \u2022
and \u00e9
, I would like to find \u
and remove it along with the 4 character substring 2022
and 00e9
followed after that. I'm looking for an adequate regex for this pattern.
Example Code:
import json
import io
import re
files = glob('Candidate Profile Data/*')
for file_ in files:
with io.open(file_, 'r', encoding='us-ascii') as json_file:
json_data = json_file.read().decode()
json_data = re.sub('[^\x00-\x7F]+',' ',json_data)
json_data = json_data.replace('\\n',' ')
json_data = re.sub(r'\\u[0-9a-f]{,4}',' ',json_data)
print json_data
json_data = json.loads(json_data)
print(json_data)
Upvotes: 2
Views: 95
Reputation: 1761
Really, we need an example of your code, but as a pointer, the regex i think you'll need is something like r'\\u[0-9a-f]{,4}'
Here is an example of it in use:
>>> import re
>>> my_string='Hello \\u2022 Created, reviewed, \\u00e9executed and maintained'
>>> my_string
'Hello \\u2022 Created, reviewed, \\u00e9executed and maintained'
>>> re.sub(r'\\u[0-9a-f]{,4}',"",my_string)
'Hello Created, reviewed, executed and maintained'
Would still like to see an example of your CODE so that we can provide a more accurate answer
Upvotes: 2