Reputation: 310
I have multiple JSON files filled with strings that can get up to several hundred lines. I'll only have three lines in my example of the file, but on average there are about 200-500 of these "phrases":
{
"version": 1,
"data": {
"phrases":[
"A few words that's it.",
"This one, has a comma in it!",
"hyphenated-sentence example"
]
}
}
I need to have a script go in to the file (we can call it ExampleData.json) and remove all punctuation (specifically these characters: ,.?!'-
from the file, without removing the ,
outside of the double quotation marks. Essentially so that this:
"A few words that's it.",
"This one, has a comma in it!",
"hyphenated-sentence example."
Becomes this:
"A few words that's it",
"This one has a comma in it",
"hyphenated sentence example"
Also note how all the punctuation gets removed except for the hyphen. That gets replaced with a space.
The closest I've gotten with python was with a string via someone else's answer on a different thread.
input_str = 'please, remove all the commas between quotes,"like in here, here, here!"'
quotes = False
def noCommas(string):
quotes = False
output = ''
for char in string:
if char == '"':
quotes = True
if quotes == False:
output += char
if char != ',' and quotes == True:
output += char
return output
print noCommas(input_str)
(Sorry, I don't know how to put code blocks in a quote)
But it only works for a single character at a time. But by adding any extra rules causes the text outside the quotes to double themselves (please becomes pplleeaassee).
One last thing is that I have to do this in python2.7.5, which from what I've put together searching around, makes this a bit more difficult.
I'm sorry that I'm still this new to python and have to do something this non-trivial right away, but it wasn't really my choice.
Upvotes: 1
Views: 659
Reputation: 28630
This should work.
import re
import json
with open('C:/test/data.json') as json_file:
data = json.load(json_file)
for idx, v in enumerate(data['data']['phrases']):
data['data']['phrases'][idx] = re.sub(r'-',' ',data['data']['phrases'][idx])
data['data']['phrases'][idx] = re.sub(r'[^\w\s]','',data['data']['phrases'][idx])
with open('C:/test/data.json', 'w') as outfile:
json.dump(data, outfile, indent=4)
Option 2:
Load in the json as a string. Then use regex to find all substrings between double quotes. Replace/strip the punctuation from all those substrings, then write back to file:
import re
import json
import string
with open('C:/test/data.json') as json_file:
data = json.load(json_file)
data = json.dumps(data)
strings = re.findall(r'"([^"]*)"', data)
for each in strings:
new_str = re.sub(r'-',' ', each)
new_str = new_str.strip(string.punctuation)
new_str = re.sub(r',','', new_str)
data = data.replace('"%s"' %each, '"%s"' %new_str)
with open('C:/test/data_output.json', 'w') as outfile:
json.dump(json.loads(data), outfile, indent=4)
Upvotes: 4