Lukasz Tracewski
Lukasz Tracewski

Reputation: 11377

Multiple quotechars when reading a CSV

I am trying to read a CSV that looks like this:

import pandas as pd
from io import StringIO

s = """
System Name,System ID,System Type,Flood Source,System Authorization,Rehabilitation Program Status,Responsible Organization
"Napa River, left bank above Tulocay Creek","5305000080","Channel","Napa River","USACE Federally constructed, turned over to public sponsor operations and maintenance","Active","USACE - San Francisco District"
"Napa River, right bank below Napa Creek","5305000050","Levee System","","USACE Federally constructed, turned over to public sponsor operations and maintenance","Active","USACE - San Francisco District"
"Needles "S" Street ","3805030008","Levee System",""S" Street Wash, Dead Mountain HA","USACE Federally constructed, turned over to public sponsor operations and maintenance","Inactive","USACE - Los Angeles District"
"Nevada County Levee 1","1905046000","Levee System","Donner Creek","Locally Constructed, Locally Operated and Maintained","Not Enrolled","California"
"Nevada Levee","7005000873","Levee System","","Other Federal Agency","Not Enrolled","Bureau of Reclamation"
"""

pd.read_csv(StringIO(s))

The problem is that "Needles "S" Street " has multiple quotechars and results in ParseError:

ParserError: Error tokenizing data. C error: Expected 7 fields in line 5, saw 8

I tried this approach, but all attempts at writing my own separator end in having a single-column dataframe. Ideas?

Upvotes: 1

Views: 349

Answers (1)

CodeMonkey
CodeMonkey

Reputation: 23738

Quotes inside quotes must be escaped with double quotes "" and this line is incorrectly escaped.

"Needles "S" Street ","3805030008","Levee System",""S" Street Wash, Dead Mountain HA",...
         ^^^                                       ^^^

"S" must be escaped as as ""S"" in two places. The second place has a preceding quote so the entire multi-line string must be quoted with ''' rather than """.

import pandas as pd
from io import StringIO

s = '''System Name,System ID,System Type,Flood Source,System Authorization,Rehabilitation Program Status,Responsible Organization
"Napa River, left bank above Tulocay Creek","5305000080","Channel","Napa River","USACE Federally constructed, turned over to public sponsor operations and maintenance","Active","USACE - San Francisco District"
"Napa River, right bank below Napa Creek","5305000050","Levee System","","USACE Federally constructed, turned over to public sponsor operations and maintenance","Active","USACE - San Francisco District"
"Needles ""S"" Street ","3805030008","Levee System","""S"" Street Wash, Dead Mountain HA","USACE Federally constructed, turned over to public sponsor operations and maintenance","Inactive","USACE - Los Angeles District"
"Nevada County Levee 1","1905046000","Levee System","Donner Creek","Locally Constructed, Locally Operated and Maintained","Not Enrolled","California"
"Nevada Levee","7005000873","Levee System","","Other Federal Agency","Not Enrolled","Bureau of Reclamation"
'''

df = pd.read_csv(StringIO(s))
print(df)

Output:

                                 System Name  ...        Responsible Organization
0  Napa River, left bank above Tulocay Creek  ...  USACE - San Francisco District
1    Napa River, right bank below Napa Creek  ...  USACE - San Francisco District
2                        Needles "S" Street   ...    USACE - Los Angeles District
3                      Nevada County Levee 1  ...                      California
4                               Nevada Levee  ...           Bureau of Reclamation

If you cannot easily fix your data then a quick fix to the data would be to encode all the legal quotes in your data, remove the illegal quotes, then re-quote the data.

# 1. replace legal quotes with another symbol to replace back later
s = s.replace('\n"', "\n|").replace('"\n', '|\n')
s = s.replace('",', '|,').replace(',"', ',|')

# 2. remove all illegal quote characters in the data
s = s.replace('"', "")

# 3. re-quote the data
s = s.replace('|', '"')

Or do only step 1 and change the quote and delimiter on pd.read_csv() call. This will preserve the illegal quotes as-is.

s = s.replace('\n"', "\n|").replace('"\n', '|\n')
s = s.replace('",', '|,').replace(',"', ',|')

df = pd.read_csv(StringIO(s), delimiter=',', quotechar='|')

Upvotes: 2

Related Questions