terraCoder
terraCoder

Reputation: 65

How to read a CSV file where rows are quoted into a dataframe

I have a CSV file that looks like this,

title 1
"x,y,z,w"
"1,2,3,4"
title 2
"a,s,d,f,g,h,j,k,l,z,x,c,v,b,n,m"
"1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7"
"1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7"
"1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7"
"1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7"
"1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7"
"1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7"
"1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7"
title 3
x,y,z,w
1,2,3,4

I am trying to read this CSV file which has a different column structure, which I was able to do that using skiprows, skipfooter, and quoting=csv.QUOTE_NONE. I am using quoting=csv.QUOTE_NONE to interpret the double-quotes (") row with different columns, not as a single string.

import csv
import pandas as pd

title_1 = 0
title_2 = 3
title_3 = 12
total = 14

title_1_df = pd.read_csv("test_csv.csv", engine='python', skiprows=title_1 + 1, skipfooter =(total - title_2) + 1, quoting=csv.QUOTE_NONE)
title_2_df = pd.read_csv("test_csv.csv", engine='python', skiprows=title_2 + 1, skipfooter=(total - title_2) + 1, quoting=csv.QUOTE_NONE)
title_3_df = pd.read_csv("test_csv.csv", engine='python', skiprows=title_3 + 1)

I was able to read the CSV file, but the double quotes also came with it in DataFrame. Below is the output.

Is there a way to remove the double quotes while reading the CSV file? As you can see in the output of title_1_df and title_2_df double quotes are coming in the first and last column and I would like to the output as title_3_df.

Upvotes: 0

Views: 1889

Answers (2)

user17242583
user17242583

Reputation:

Looking carefully through pd.read_csv's (many) options, I can't find a way of removing these quotes during the read, and, thinking about it, I'm not sure there should be one.

Quoting is done for individual values, not rows. For example, it often might be used to intentionally store the commas as cell data, rather than cell separators. By telling pandas to ignore the quoting, you're also telling it the quote character (") is just normal data.

For this case, I'd either strip the CSV file of quotes prior to reading it, or remove the quotes after reading it.


Method 1. Removing the quotes before reading the file

Two ways. The first is doing it line by line, which is more intuitive but less efficient. The other way other way is doing it all in one go (less intuitive but more efficient):

import re

with open('test_csv.csv') as f:
    text = re.sub(r'"*([\r\n])+"*|(?:^"*|"*$)', '\\1', f.read())

Now, you can either write the processed data back to the file and read the file with pd.read_csv, or you can read the CSV from the string directly. I'll show both methods:

Writing back to the file:

with open('test_csv.csv', 'w') as f:
    f.write(text)

header_df = pd.read_csv("test_csv.csv", ...)
data_df = pd.read_csv("test_csv.csv", ...)
footer_df = pd.read_csv("test_csv.csv", ...)

Reading directly from the processed string:

from io import StringIO
s = StringIO(text)

header_df = pd.read_csv(s, ...);  s.seek(0)
data_df = pd.read_csv(s, ...);  s.seek(0)
footer_df = pd.read_csv(s, ...);  s.seek(0)

Method 2. Removing the quotes after reading the file

Use df.iloc[:, [0, -1]] to select the first and last column of a dataframe:

def remove_quotes(df):
    df.iloc[:, [0, -1]] = df.iloc[:, [0, -1]].astype(str).apply(lambda col: col.str.strip('"')).astype(int)
    df.columns = df.columns.str.strip('"')
    
remove_quotes(header_df)
remove_quotes(data_df)
remove_quotes(footer_df)

Output:

>>> header_df
   x  y  z  w
0  1  2  3  4

>>> data_df
   a  s  d  f  g  h  j  k  l  z  x  c  v  b  n  m
0  1  2  3  4  5  6  7  8  9  1  2  3  4  5  6  7
1  1  2  3  4  5  6  7  8  9  1  2  3  4  5  6  7
2  1  2  3  4  5  6  7  8  9  1  2  3  4  5  6  7
3  1  2  3  4  5  6  7  8  9  1  2  3  4  5  6  7
4  1  2  3  4  5  6  7  8  9  1  2  3  4  5  6  7
5  1  2  3  4  5  6  7  8  9  1  2  3  4  5  6  7
6  1  2  3  4  5  6  7  8  9  1  2  3  4  5  6  7

>>> footer_df
   x  y  z  w
0  1  2  3  4

Upvotes: 2

Chris Sears
Chris Sears

Reputation: 6802

You can do it in 2 passes, first read them in as one column, then create a new DataFrame by splitting the values of the column on ','.

import pandas as pd

df = pd.read_csv('quotes.csv')
df2 = pd.DataFrame(df[df.columns[0]].apply(lambda x: x.split(',')).to_list(), columns=df.columns[0].split(','))
print(df2)

Output:

   a  s  d  f  g  h  j  k  l  z  x  c  v  b  n  m
0  1  2  3  4  5  6  7  8  9  1  2  3  4  5  6  7
1  1  2  3  4  5  6  7  8  9  1  2  3  4  5  6  7
2  1  2  3  4  5  6  7  8  9  1  2  3  4  5  6  7
3  1  2  3  4  5  6  7  8  9  1  2  3  4  5  6  7
4  1  2  3  4  5  6  7  8  9  1  2  3  4  5  6  7
5  1  2  3  4  5  6  7  8  9  1  2  3  4  5  6  7
6  1  2  3  4  5  6  7  8  9  1  2  3  4  5  6  7

Upvotes: 1

Related Questions