Reputation: 691

Regex to remove doubled double quotes from CSV

I have an excel sheet that has a lot of data in it in one column in the form of a python dictionary from a sql database. I don't have access to the original database and I can't import the CSV back into sql with the local infile command due to the fact that the keys/values on each row of the CSV are not in the same order. When I export the excel sheet to CSV I get:

"{""first_name"":""John"",""last_name"":""Smith"",""age"":30}"
"{""first_name"":""Tim"",""last_name"":""Johnson"",""age"":34}"

What is the best way to remove the " before and after the curly brackets as well as the extra " around the keys/values?

I also need to leave the integers alone that don't have quotes around them.

I am trying to then import this into python with the json module so that I can print specific keys but I can't import them with the doubled double quotes. I ultimately need the data saved in a file that looks like:

{"first_name":"John","last_name":"Smith","age":30}
{"first_name":"Tim","last_name":"Johnson","age":34}

Any help is most appreciated!

Upvotes: 2

Answers (5)

elyase

Reputation: 40973

This should do it:

with open('old.csv') as old, open('new.csv', 'w') as new:
    new.writelines(re.sub(r'"(?!")', '', line) for line in old)

Upvotes: 2

ridgerunner

Reputation: 34395

Easy:

`text = re.sub(r'"(?!")', '', text)`

Given the input file: TEST.TXT:

"{""first_name"":""John"",""last_name"":""Smith"",""age"":30}"
"{""first_name"":""Tim"",""last_name"":""Johnson"",""age"":34}"

The script:

import re
f = open("TEST.TXT","r")
text_in = f.read()
text_out = re.sub(r'"(?!")', '', text_in)
print(text_out)

produces the following output:

{"first_name":"John","last_name":"Smith","age":30}
{"first_name":"Tim","last_name":"Johnson","age":34}

Upvotes: 2

dawg

Reputation: 103884

You can actual use the csv module and regex to do this:

st='''\
"{""first_name"":""John"",""last_name"":""Smith"",""age"":30}"
"{""first_name"":""Tim"",""last_name"":""Johnson"",""age"":34}"\
'''

import csv, re

data=[]
reader=csv.reader(st, dialect='excel')
for line in reader:
    data.extend(line)

s=re.sub(r'(\w+)',r'"\1"',''.join(data))
s=re.sub(r'({[^}]+})',r'\1\n',s).strip()
print s

Prints

{"first_name":"John","last_name":"Smith","age":"30"}
{"first_name":"Tim","last_name":"Johnson","age":"34"}

Upvotes: 1

Mario Rossi

Reputation: 7799

If the input file is just as shown, and of the small size you mention, you can load the whole file in memory, make the substitutions, and then save it. IMHO, you don't need a RegEx to do this. The easiest to read code that does this is:

with open(filename) as f:
    input= f.read()
input= str.replace('""','"')
input= str.replace('"{','{')
input= str.replace('}"','}')
with open(filename, "w") as f:
    f.write(input)

I tested it with the sample input and it produces:

{"first_name":"John","last_name":"Smith","age":30}
{"first_name":"Tim","last_name":"Johnson","age":34}

Which is exactly what you want.

If you want, you can also pack the code and write

with open(inputFilename) as if:
    with open(outputFilename, "w") as of:
        of.write(if.read().replace('""','"').replace('"{','{').replace('}"','}'))

but I think the first one is much clearer and both do exactly the same.

Upvotes: 1

jabaldonedo

Reputation: 26582

I think you are overthinking the problem, why don't replace data?

l = list()
with open('foo.txt') as f:
    for line in f:
        l.append(line.replace('""','"').replace('"{','{').replace('}"','}'))
s = ''.join(l)

print s # or save it to file

It generates:

{"first_name":"John","last_name":"Smith","age":30}
{"first_name":"Tim","last_name":"Johnson","age":34}

Use a list to store intermediate lines and then invoke .join for improving performance as explained in Good way to append to a string

Upvotes: 1

Regex to remove doubled double quotes from CSV

Answers (5)

text = re.sub(r'"(?!")', '', text)

Related Questions

`text = re.sub(r'"(?!")', '', text)`