bonzo
bonzo

Reputation: 310

Remove Puncuation From JSON File Only Inside Quotation Marks

I have multiple JSON files filled with strings that can get up to several hundred lines. I'll only have three lines in my example of the file, but on average there are about 200-500 of these "phrases":

{
   "version": 1,
   "data": {
       "phrases":[
           "A few words that's it.",
           "This one, has a comma in it!",
           "hyphenated-sentence example"
        ]
   }
}

I need to have a script go in to the file (we can call it ExampleData.json) and remove all punctuation (specifically these characters: ,.?!'- from the file, without removing the , outside of the double quotation marks. Essentially so that this:

"A few words that's it.",
"This one, has a comma in it!",
"hyphenated-sentence example."

Becomes this:

"A few words that's it",
"This one has a comma in it",
"hyphenated sentence example"

Also note how all the punctuation gets removed except for the hyphen. That gets replaced with a space.


I've found a near identical question like this posed but for csv files here, but haven't been able to translate the csv version into something that will work with JSON.

The closest I've gotten with python was with a string via someone else's answer on a different thread.

input_str = 'please, remove all the commas between quotes,"like in here, here, here!"'

quotes = False

def noCommas(string):
    quotes = False
    output = ''
    for char in string:
        if char == '"':
            quotes = True
        if quotes == False:
            output += char
        if char != ',' and quotes == True:
            output += char
    return output

print noCommas(input_str)

(Sorry, I don't know how to put code blocks in a quote)
But it only works for a single character at a time. But by adding any extra rules causes the text outside the quotes to double themselves (please becomes pplleeaassee).
One last thing is that I have to do this in python2.7.5, which from what I've put together searching around, makes this a bit more difficult.
I'm sorry that I'm still this new to python and have to do something this non-trivial right away, but it wasn't really my choice.

Upvotes: 1

Views: 659

Answers (1)

chitown88
chitown88

Reputation: 28630

This should work.

import re
import json

with open('C:/test/data.json') as json_file:
    data = json.load(json_file)



for idx, v in enumerate(data['data']['phrases']):
    data['data']['phrases'][idx] = re.sub(r'-',' ',data['data']['phrases'][idx])
    data['data']['phrases'][idx] = re.sub(r'[^\w\s]','',data['data']['phrases'][idx])


with open('C:/test/data.json', 'w') as outfile:
    json.dump(data, outfile,  indent=4)

Option 2:

Load in the json as a string. Then use regex to find all substrings between double quotes. Replace/strip the punctuation from all those substrings, then write back to file:

import re
import json
import string




with open('C:/test/data.json') as json_file:
    data = json.load(json_file)

data = json.dumps(data)

strings = re.findall(r'"([^"]*)"', data)

for each in strings:
    new_str =  re.sub(r'-',' ', each)
    new_str = new_str.strip(string.punctuation)
    new_str =  re.sub(r',','', new_str)

    data = data.replace('"%s"' %each, '"%s"' %new_str)


with open('C:/test/data_output.json', 'w') as outfile:
    json.dump(json.loads(data), outfile,  indent=4)

Upvotes: 4

Related Questions