Reputation: 11377
I am trying to read a CSV that looks like this:
import pandas as pd
from io import StringIO
s = """
System Name,System ID,System Type,Flood Source,System Authorization,Rehabilitation Program Status,Responsible Organization
"Napa River, left bank above Tulocay Creek","5305000080","Channel","Napa River","USACE Federally constructed, turned over to public sponsor operations and maintenance","Active","USACE - San Francisco District"
"Napa River, right bank below Napa Creek","5305000050","Levee System","","USACE Federally constructed, turned over to public sponsor operations and maintenance","Active","USACE - San Francisco District"
"Needles "S" Street ","3805030008","Levee System",""S" Street Wash, Dead Mountain HA","USACE Federally constructed, turned over to public sponsor operations and maintenance","Inactive","USACE - Los Angeles District"
"Nevada County Levee 1","1905046000","Levee System","Donner Creek","Locally Constructed, Locally Operated and Maintained","Not Enrolled","California"
"Nevada Levee","7005000873","Levee System","","Other Federal Agency","Not Enrolled","Bureau of Reclamation"
"""
pd.read_csv(StringIO(s))
The problem is that "Needles "S" Street "
has multiple quotechars and results in ParseError
:
ParserError: Error tokenizing data. C error: Expected 7 fields in line 5, saw 8
I tried this approach, but all attempts at writing my own separator end in having a single-column dataframe. Ideas?
Upvotes: 1
Views: 349
Reputation: 23738
Quotes inside quotes must be escaped with double quotes "" and this line is incorrectly escaped.
"Needles "S" Street ","3805030008","Levee System",""S" Street Wash, Dead Mountain HA",...
^^^ ^^^
"S" must be escaped as as ""S"" in two places. The second place has a preceding quote so the entire multi-line string must be quoted with ''' rather than """.
import pandas as pd
from io import StringIO
s = '''System Name,System ID,System Type,Flood Source,System Authorization,Rehabilitation Program Status,Responsible Organization
"Napa River, left bank above Tulocay Creek","5305000080","Channel","Napa River","USACE Federally constructed, turned over to public sponsor operations and maintenance","Active","USACE - San Francisco District"
"Napa River, right bank below Napa Creek","5305000050","Levee System","","USACE Federally constructed, turned over to public sponsor operations and maintenance","Active","USACE - San Francisco District"
"Needles ""S"" Street ","3805030008","Levee System","""S"" Street Wash, Dead Mountain HA","USACE Federally constructed, turned over to public sponsor operations and maintenance","Inactive","USACE - Los Angeles District"
"Nevada County Levee 1","1905046000","Levee System","Donner Creek","Locally Constructed, Locally Operated and Maintained","Not Enrolled","California"
"Nevada Levee","7005000873","Levee System","","Other Federal Agency","Not Enrolled","Bureau of Reclamation"
'''
df = pd.read_csv(StringIO(s))
print(df)
Output:
System Name ... Responsible Organization
0 Napa River, left bank above Tulocay Creek ... USACE - San Francisco District
1 Napa River, right bank below Napa Creek ... USACE - San Francisco District
2 Needles "S" Street ... USACE - Los Angeles District
3 Nevada County Levee 1 ... California
4 Nevada Levee ... Bureau of Reclamation
If you cannot easily fix your data then a quick fix to the data would be to encode all the legal quotes in your data, remove the illegal quotes, then re-quote the data.
# 1. replace legal quotes with another symbol to replace back later
s = s.replace('\n"', "\n|").replace('"\n', '|\n')
s = s.replace('",', '|,').replace(',"', ',|')
# 2. remove all illegal quote characters in the data
s = s.replace('"', "")
# 3. re-quote the data
s = s.replace('|', '"')
Or do only step 1 and change the quote and delimiter on pd.read_csv()
call. This will preserve the illegal quotes as-is.
s = s.replace('\n"', "\n|").replace('"\n', '|\n')
s = s.replace('",', '|,').replace(',"', ',|')
df = pd.read_csv(StringIO(s), delimiter=',', quotechar='|')
Upvotes: 2