Reputation: 1810
I have a csv file with a variable number of columns.
Pandas can handle the situation fairly easily with the name
parameter. This works if the first row of the file is either the same length or shorter of names
, but fails if it's longer.
First row has the same number of elements as names
from io import StringIO
import pandas as pd
file = StringIO(
'''1, 2, 3,
1, 2
1, 2, 3, 4,
1, 2, 3,''')
df = pd.read_csv(file, names=['A','B','C'], index_col=False)
df
# Out[0]:
# A B C
# 0 1 2 3.0
# 1 1 2 NaN
# 2 1 2 3.0
# 3 1 2 3.0
This works.
First row has less elements than names
from io import StringIO
import pandas as pd
file = StringIO(
'''1, 2,
1, 2
1, 2, 3, 4,
1, 2, 3,''')
df = pd.read_csv(file, names=['A','B','C'], index_col=False)
df
# Out[0]:
# A B C
# 0 1 2 NaN
# 1 1 2 NaN
# 2 1 2 3.0
# 3 1 2 3.0
This works too.
First row has more elements than names
from io import StringIO
import pandas as pd
file = StringIO(
'''1, 2, 3, 4,
1, 2
1, 2, 3, 4,
1, 2, 3,''')
df = pd.read_csv(file, names=['A','B','C'], index_col=False)
df
# Expected output:
# A B C
# 0 1 2 3.0
# 1 1 2 NaN
# 2 1 2 3.0
# 3 1 2 3.0
#
# Instead I get:
# IndexError: list index out of range
This doesn't work!!!
Is there a way to make this script more robust so that it handles also Case 3 without generating an error? Doesn't this looks a bit like a bug
I'm running pandas version 0.23.4
First row has more elements than names
from io import StringIO
import pandas as pd
file = StringIO(
'''1, 2, 3, 4
1, 2
1, 2, 3, 4,
1, 2, 3,''') # NOTE: No comma after the first "4"!!
df = pd.read_csv(file, names=['A','B','C'], index_col=False)
df
#Out[0]:
# A B C
# 0 1 2 3.0
# 1 1 2 NaN
# 2 1 2 3.0
# 3 1 2 3.0
This works!!!
What the hell is going on?
Upvotes: 0
Views: 3510
Reputation: 1398
Just make use of the usecols
params, instead of the names
one. names
assume that you're listing all the columns' name, whereas usecols
assume a subample of the columns.
from io import StringIO
import pandas as pd
file = StringIO(
'''1, 2, 3, 4,
1, 2
1, 2, 3, 4,
1, 2, 3,''')
df = pd.read_csv(file, usecols =[0,1,2], header = None)
df
0 1 2
0 1 2 3.0
1 1 2 NaN
2 1 2 3.0
3 1 2 3.0
Upvotes: 3