Reputation: 65
I have a CSV file that looks like this,
title 1
"x,y,z,w"
"1,2,3,4"
title 2
"a,s,d,f,g,h,j,k,l,z,x,c,v,b,n,m"
"1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7"
"1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7"
"1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7"
"1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7"
"1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7"
"1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7"
"1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7"
title 3
x,y,z,w
1,2,3,4
I am trying to read this CSV file which has a different column structure, which I was able to do that using skiprows
, skipfooter
, and quoting=csv.QUOTE_NONE
. I am using quoting=csv.QUOTE_NONE
to interpret the double-quotes ("
) row with different columns, not as a single string.
import csv
import pandas as pd
title_1 = 0
title_2 = 3
title_3 = 12
total = 14
title_1_df = pd.read_csv("test_csv.csv", engine='python', skiprows=title_1 + 1, skipfooter =(total - title_2) + 1, quoting=csv.QUOTE_NONE)
title_2_df = pd.read_csv("test_csv.csv", engine='python', skiprows=title_2 + 1, skipfooter=(total - title_2) + 1, quoting=csv.QUOTE_NONE)
title_3_df = pd.read_csv("test_csv.csv", engine='python', skiprows=title_3 + 1)
I was able to read the CSV file, but the double quotes also came with it in DataFrame. Below is the output.
Is there a way to remove the double quotes while reading the CSV file? As you can see in the output of title_1_df
and title_2_df
double quotes are coming in the first and last column and I would like to the output as title_3_df
.
Upvotes: 0
Views: 1889
Reputation:
Looking carefully through pd.read_csv
's (many) options, I can't find a way of removing these quotes during the read, and, thinking about it, I'm not sure there should be one.
Quoting is done for individual values, not rows. For example, it often might be used to intentionally store the commas as cell data, rather than cell separators. By telling pandas to ignore the quoting, you're also telling it the quote character ("
) is just normal data.
For this case, I'd either strip the CSV file of quotes prior to reading it, or remove the quotes after reading it.
Two ways. The first is doing it line by line, which is more intuitive but less efficient. The other way other way is doing it all in one go (less intuitive but more efficient):
import re
with open('test_csv.csv') as f:
text = re.sub(r'"*([\r\n])+"*|(?:^"*|"*$)', '\\1', f.read())
Now, you can either write the processed data back to the file and read the file with pd.read_csv
, or you can read the CSV from the string directly. I'll show both methods:
Writing back to the file:
with open('test_csv.csv', 'w') as f:
f.write(text)
header_df = pd.read_csv("test_csv.csv", ...)
data_df = pd.read_csv("test_csv.csv", ...)
footer_df = pd.read_csv("test_csv.csv", ...)
Reading directly from the processed string:
from io import StringIO
s = StringIO(text)
header_df = pd.read_csv(s, ...); s.seek(0)
data_df = pd.read_csv(s, ...); s.seek(0)
footer_df = pd.read_csv(s, ...); s.seek(0)
Use df.iloc[:, [0, -1]]
to select the first and last column of a dataframe:
def remove_quotes(df):
df.iloc[:, [0, -1]] = df.iloc[:, [0, -1]].astype(str).apply(lambda col: col.str.strip('"')).astype(int)
df.columns = df.columns.str.strip('"')
remove_quotes(header_df)
remove_quotes(data_df)
remove_quotes(footer_df)
Output:
>>> header_df
x y z w
0 1 2 3 4
>>> data_df
a s d f g h j k l z x c v b n m
0 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7
1 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7
2 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7
3 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7
4 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7
5 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7
6 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7
>>> footer_df
x y z w
0 1 2 3 4
Upvotes: 2
Reputation: 6802
You can do it in 2 passes, first read them in as one column, then create a new DataFrame by splitting the values of the column on ','.
import pandas as pd
df = pd.read_csv('quotes.csv')
df2 = pd.DataFrame(df[df.columns[0]].apply(lambda x: x.split(',')).to_list(), columns=df.columns[0].split(','))
print(df2)
Output:
a s d f g h j k l z x c v b n m
0 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7
1 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7
2 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7
3 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7
4 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7
5 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7
6 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7
Upvotes: 1