Luca
Luca

Reputation: 1810

Reading csv with variable number of columns with pandas

I have a csv file with a variable number of columns.

Pandas can handle the situation fairly easily with the name parameter. This works if the first row of the file is either the same length or shorter of names, but fails if it's longer.

Examples

Case 1

First row has the same number of elements as names

from io import StringIO
import pandas as pd

file = StringIO(
        '''1, 2, 3,
           1, 2
           1, 2, 3, 4, 
           1, 2, 3,''')

df = pd.read_csv(file, names=['A','B','C'], index_col=False)
df

# Out[0]:
#    A  B    C
# 0  1  2  3.0
# 1  1  2  NaN
# 2  1  2  3.0
# 3  1  2  3.0

This works.

Case 2

First row has less elements than names

from io import StringIO
import pandas as pd

file = StringIO(
        '''1, 2,
           1, 2
           1, 2, 3, 4, 
           1, 2, 3,''')

df = pd.read_csv(file, names=['A','B','C'], index_col=False)
df

# Out[0]:
#    A  B    C
# 0  1  2  NaN
# 1  1  2  NaN
# 2  1  2  3.0
# 3  1  2  3.0

This works too.

Case 3

First row has more elements than names

from io import StringIO
import pandas as pd

file = StringIO(
        '''1, 2, 3, 4,
           1, 2
           1, 2, 3, 4, 
           1, 2, 3,''')

df = pd.read_csv(file, names=['A','B','C'], index_col=False)
df

# Expected output:
#    A  B    C
# 0  1  2  3.0
# 1  1  2  NaN
# 2  1  2  3.0
# 3  1  2  3.0
#
# Instead I get:
# IndexError: list index out of range

This doesn't work!!!

Question

Is there a way to make this script more robust so that it handles also Case 3 without generating an error? Doesn't this looks a bit like a bug

I'm running pandas version 0.23.4

EDIT:

Case 3b

First row has more elements than names

from io import StringIO
import pandas as pd

file = StringIO(
        '''1, 2, 3, 4
           1, 2
           1, 2, 3, 4, 
           1, 2, 3,''') # NOTE: No comma after the first "4"!!

df = pd.read_csv(file, names=['A','B','C'], index_col=False)
df

#Out[0]:
#    A  B    C
# 0  1  2  3.0
# 1  1  2  NaN
# 2  1  2  3.0
# 3  1  2  3.0

This works!!!

What the hell is going on?

Upvotes: 0

Views: 3510

Answers (1)

Neroksi
Neroksi

Reputation: 1398

Just make use of the usecols params, instead of the names one. names assume that you're listing all the columns' name, whereas usecolsassume a subample of the columns.

from io import StringIO
import pandas as pd

file = StringIO(
        '''1, 2, 3, 4,
           1, 2
           1, 2, 3, 4, 
           1, 2, 3,''')

df = pd.read_csv(file, usecols =[0,1,2], header = None)
df
0   1   2
0   1   2   3.0
1   1   2   NaN
2   1   2   3.0
3   1   2   3.0

Upvotes: 3

Related Questions