padmanabh pande
padmanabh pande

Reputation: 427

Pandas read csv ignoring " character

I have a data frame in CSV separated by the character semicolon(;). Few columns in my data frame are of type JSON and I want to convert them into python dict in the later part.

CSV file <tmp.csv>:

7.384906;7.3849072;2;0.2708226296521021;0;0;0;;
9.05233;9.05192;5;0.5523690040041611;1;0;0;"{\"EMAIL_SENT\": 1}";"{\"-1\": 1}"
3.6593602;3.6593602;2;0.5465436324626254;0;0;0;;
3.5134177;3.5130887;2;0.5692996018584914;0;0;0;;
0.6824124;0.6824124;2;2.9307791611130423;0;0;0;"{\"STAGE_CHANGE\": 1, \"CREATE\": 1}";"{\"-1\": 2}"

When I read this file into pandas,

>>> df = pd.read_csv("tmp.csv", sep=";", header=None)
>>> df

          0         1  2         3  4  5  6                                     7            8
0  7.384906  7.384907  2  0.270823  0  0  0                                   NaN          NaN
1  9.052330  9.051920  5  0.552369  1  0  0                   {\EMAIL_SENT\": 1}"  {\-1\": 1}"
2  3.659360  3.659360  2  0.546544  0  0  0                                   NaN          NaN
3  3.513418  3.513089  2  0.569300  0  0  0                                   NaN          NaN
4  0.682412  0.682412  2  2.930779  0  0  0  {\STAGE_CHANGE\": 1, \"CREATE\": 1}"  {\-1\": 2}"

As shown here, my JSON columns are not read properly. I'm missing character " in the beginning of every JSON. And because of this I cannot convert this to python dict.

>> df[8].apply(lambda elem: {} if pd.isna(elem) else json.loads(elem))

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/anaconda3/envs/forecasting/lib/python3.7/site-packages/pandas/core/series.py", line 3848, in apply
    mapped = lib.map_infer(values, f, convert=convert_dtype)
  File "pandas/_libs/lib.pyx", line 2329, in pandas._libs.lib.map_infer
  File "<stdin>", line 1, in <lambda>
  File "/opt/anaconda3/envs/forecasting/lib/python3.7/json/__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "/opt/anaconda3/envs/forecasting/lib/python3.7/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/opt/anaconda3/envs/forecasting/lib/python3.7/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)

Please let me know how to fix this.

Upvotes: 1

Views: 1104

Answers (1)

Stef
Stef

Reputation: 30579

Use no quoting in reading the csv and then strip the leading/trailing double quotes before loading the string into json.

df = pd.read_csv("tmp.csv", sep=";", header=None, quoting=3)
df[7].apply(lambda elem: {} if pd.isna(elem) else json.loads(elem.strip('"')))
df[8].apply(lambda elem: {} if pd.isna(elem) else json.loads(elem.strip('"')))

Result:

          0         1  2  ...  6                                 7          8
0  7.384906  7.384907  2  ...  0                                {}         {}
1  9.052330  9.051920  5  ...  0                 {'EMAIL_SENT': 1}  {'-1': 1}
2  3.659360  3.659360  2  ...  0                                {}         {}
3  3.513418  3.513089  2  ...  0                                {}         {}
4  0.682412  0.682412  2  ...  0  {'STAGE_CHANGE': 1, 'CREATE': 1}  {'-1': 2}


Update for the case when the json strings contain the separator
I don't have a general solution, only a (rather ugly) workaround for the case in your example. If you know that the json strings are in the last columns you can read the csv as one column by using a separator that is guaranteed to not be in the strings, then split the first columns on the real separator and the json column on the real separator surrounded by double quotes. From here you can proceed as previously, only empty columns are '' instead of NA.

Example:

df = pd.read_csv('tmp.csv', sep='\n', header=None)[0].str.split(';', 7, expand=True)
df[[7,8]] = df[7].str.split('";"|^;$', expand=True)
df[7] = df[7].apply(lambda elem: {} if elem == '' else json.loads(elem.strip('"').replace('\\"', '"')))
df[8] = df[8].apply(lambda elem: {} if elem == '' else json.loads(elem.strip('"').replace('\\"', '"')))

Result:

           0          1  2                   3  4  5  6                                     7          8
0   7.384906  7.3849072  2  0.2708226296521021  0  0  0                                    {}         {}
1    9.05233    9.05192  5  0.5523690040041611  1  0  0                     {'EMAIL_SENT': 1}  {'-1': 1}
2  3.6593602  3.6593602  2  0.5465436324626254  0  0  0                                    {}         {}
3  3.5134177  3.5130887  2  0.5692996018584914  0  0  0                                    {}         {}
4  0.6824124  0.6824124  2  2.9307791611130423  0  0  0  {'STAGE_CHANGE': 1, 'CREATE; HA': 1}  {'-1': 2}

Upvotes: 2

Related Questions