Reputation: 4335
I have a CSV that looks like this:
"ID"|"Description"|"Date"
"1234"|"good text"|"2019-10-12"
"9012"|"bad"text|here""|"2018-01-12"
I am trying to write to a new CSV that looks like this:
+--------+------------------+--------------+
| "ID" | "Description" | "Date" |
+--------+------------------+--------------+
| "1234" | "good text" | "2019-10-12" |
| "9012" | "bad text|here" | "2018-01-12" |
+--------+------------------+--------------+
The issue is that the delimiter "|" is contained in one of the description values, namely "bad|text". So when I run something like:
df = pd.read_csv(csv_file, encoding='utf-16', sep='|')
I see:
pandas.errors.ParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 4
The only resolution I see is to skip the rows containing the delimiter but obviously I want them:
Python Pandas Error tokenizing data
Any suggestions?
Note that file is encoded as UTF-16.
Upvotes: 1
Views: 291
Reputation: 1430
Here is solution, the key engine='python-fwf'
, regex seperator and some cleaning.
import pandas as pd
df = pd.read_csv('file_in.csv', engine='python-fwf', sep='"|"', quotechar='"')
df.to_csv('file_out.csv', sep='|', index=False)
Output:
print(df)
ID Description Date
0 1234 good text 2019-10-12
1 9012 bad" text here ""|"2018-01-12
file_out.csv:
ID|Description|Date
1234|good text|2019-10-12
9012|"bad"" text"|"here """"|""2018-01-12"
Upvotes: 1