Reputation: 99
I need to load in pandas a CSV file that is not 100% CSV "compliant", below an example:
"Transaction date";"Accounting date";"Counterparty's data";"Title"
2021-08-22;2021-08-22;" "SPOLEM" ASS "ALDONA" AUGUSTOW ";" Title 450"
2019-09-02;2019-09-02;" 13XYZ05 "SKOWRONEK" NIEGOWA ";" Title 1300"
2010-07-18;2010-07-18;" APTEKA "SLOWINSKA" SPOLKALEBA ";" Title 123"
I read this csv file (csv_in) into a pandas data frame with the following command:
df = pd.read_csv(csv_in, \
delimiter=';', \
engine='python', \
quoting=1)
I understand that the bad csv formatting is the culprit, but:
Thanks! Evan
Upvotes: 1
Views: 150
Reputation: 195418
You can try to clean your data by escaping the inner quotes "
by \
:
import re
import pandas as pd
from io import StringIO
with open("your_data.csv", "r") as f_in:
data = f_in.read()
data = re.sub(
r'"([^;]+)"',
lambda g: '"' + g.group(1).replace('"', r"\"").strip() + '"',
data,
)
df = pd.read_csv(StringIO(data), delimiter=";", quotechar='"', escapechar="\\")
print(df)
This prints:
Transaction date Accounting date Counterparty's data Title
0 2021-08-22 2021-08-22 "SPOLEM" ASS "ALDONA" AUGUSTOW Title 450
1 2019-09-02 2019-09-02 13XYZ05 "SKOWRONEK" NIEGOWA Title 1300
2 2010-07-18 2010-07-18 APTEKA "SLOWINSKA" SPOLKALEBA Title 123
Upvotes: 1
Reputation: 11650
specify the quotechar and remove the engine attributes (resulting in C being default)
pd.read_csv("csv2.txt", \
delimiter=';', \
quotechar='"')
Transaction date Accounting date Counterparty's data Title
0 2021-08-22 2021-08-22 SPOLEM" ASS "ALDONA" AUGUSTOW " Title 450
1 2019-09-02 2019-09-02 13XYZ05 SKOWRONEK" NIEGOWA " Title 1300
2 2010-07-18 2010-07-18 APTEKA SLOWINSKA" SPOLKALEBA " Title 123
Upvotes: 1