Eduardo Vieira
Eduardo Vieira

Reputation: 121

Pandas read_csv, reading a csv file with a missing header element

I'm trying to import a csv file with pandas.read_csv. The file is as follows:

    "COL_A","COL_B","COL_C"
    "ROW1COLA","ROW1COLB","ROW1COLC","ROW1COLD"
    "ROW2COLA","ROW2COLB","ROW2COLC","ROW2COLD"
    "ROW3COLA","ROW3COLB","ROW3COLC","ROW3COLD"
    "ROW4COLA","ROW4COLB","ROW4COLC","ROW4COLD"
    "ROW5COLA","ROW5COLB","ROW5COLC","ROW5COLD"
    "ROW6COLA","ROW6COLB","ROW6COLC","ROW6COLD"
    "ROW7COLA","ROW7COLB","ROW7COLC","ROW7COLD"

in a first attempt I ran:

    data = pd.read_csv('broken.csv')

and I got:

                 COL_A     COL_B     COL_C
    ROW1COLA  ROW1COLB  ROW1COLC  ROW1COLD
    ROW2COLA  ROW2COLB  ROW2COLC  ROW2COLD
    ROW3COLA  ROW3COLB  ROW3COLC  ROW3COLD
    ROW4COLA  ROW4COLB  ROW4COLC  ROW4COLD
    ROW5COLA  ROW5COLB  ROW5COLC  ROW5COLD
    ROW6COLA  ROW6COLB  ROW6COLC  ROW6COLD
    ROW7COLA  ROW7COLB  ROW7COLC  ROW7COLD

Setting index_col=False

    data = pd.read_csv('broken.csv',index_col=False)

i got

          COL_A     COL_B     COL_C
    0  ROW1COLA  ROW1COLB  ROW1COLC
    1  ROW2COLA  ROW2COLB  ROW2COLC
    2  ROW3COLA  ROW3COLB  ROW3COLC
    3  ROW4COLA  ROW4COLB  ROW4COLC
    4  ROW5COLA  ROW5COLB  ROW5COLC
    5  ROW6COLA  ROW6COLB  ROW6COLC
    6  ROW7COLA  ROW7COLB  ROW7COLC

if I add prefix = 'X'

    data = pd.read_csv('broken.csv',index_col=False,prefix='X')

i get

          COL_A     COL_B     COL_C
    0  ROW1COLA  ROW1COLB  ROW1COLC
    1  ROW2COLA  ROW2COLB  ROW2COLC
    2  ROW3COLA  ROW3COLB  ROW3COLC
    3  ROW4COLA  ROW4COLB  ROW4COLC
    4  ROW5COLA  ROW5COLB  ROW5COLC
    5  ROW6COLA  ROW6COLB  ROW6COLC
    6  ROW7COLA  ROW7COLB  ROW7COLC

Same with read_table

    data = pd.read_table('broken.csv',index_col=True,sep=',')

I want to know if there is any way that pandas automatically assigns a header and take values of the missing header column

Upvotes: 4

Views: 10241

Answers (2)

jezrael
jezrael

Reputation: 863791

I think you can use read_csv with parameters header=0 which first row set to columns and then is overwritten by parameter names to custom column names. Parameter sep=',' is omited, because it is by default:

import pandas as pd
import io

temp=u'''"COL_A","COL_B","COL_C"
"ROW1COLA","ROW1COLB","ROW1COLC","ROW1COLD"
"ROW2COLA","ROW2COLB","ROW2COLC","ROW2COLD"
"ROW3COLA","ROW3COLB","ROW3COLC","ROW3COLD"
"ROW4COLA","ROW4COLB","ROW4COLC","ROW4COLD"
"ROW5COLA","ROW5COLB","ROW5COLC","ROW5COLD"
"ROW6COLA","ROW6COLB","ROW6COLC","ROW6COLD"
"ROW7COLA","ROW7COLB","ROW7COLC","ROW7COLD"'''
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), header=0, names=['a','b','c','d'])

print df
          a         b         c         d
0  ROW1COLA  ROW1COLB  ROW1COLC  ROW1COLD
1  ROW2COLA  ROW2COLB  ROW2COLC  ROW2COLD
2  ROW3COLA  ROW3COLB  ROW3COLC  ROW3COLD
3  ROW4COLA  ROW4COLB  ROW4COLC  ROW4COLD
4  ROW5COLA  ROW5COLB  ROW5COLC  ROW5COLD
5  ROW6COLA  ROW6COLB  ROW6COLC  ROW6COLD
6  ROW7COLA  ROW7COLB  ROW7COLC  ROW7COLD

More generic solution with parameters header=None for no columns names from header with skiprows=[0] for skip first row with missing name of last column:

import pandas as pd
import io

temp=u'''"COL_A","COL_B","COL_C"
"ROW1COLA","ROW1COLB","ROW1COLC","ROW1COLD"
"ROW2COLA","ROW2COLB","ROW2COLC","ROW2COLD"
"ROW3COLA","ROW3COLB","ROW3COLC","ROW3COLD"
"ROW4COLA","ROW4COLB","ROW4COLC","ROW4COLD"
"ROW5COLA","ROW5COLB","ROW5COLC","ROW5COLD"
"ROW6COLA","ROW6COLB","ROW6COLC","ROW6COLD"
"ROW7COLA","ROW7COLB","ROW7COLC","ROW7COLD"'''
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), header=None, skiprows=[0])

print df
          0         1         2         3
0  ROW1COLA  ROW1COLB  ROW1COLC  ROW1COLD
1  ROW2COLA  ROW2COLB  ROW2COLC  ROW2COLD
2  ROW3COLA  ROW3COLB  ROW3COLC  ROW3COLD
3  ROW4COLA  ROW4COLB  ROW4COLC  ROW4COLD
4  ROW5COLA  ROW5COLB  ROW5COLC  ROW5COLD
5  ROW6COLA  ROW6COLB  ROW6COLC  ROW6COLD
6  ROW7COLA  ROW7COLB  ROW7COLC  ROW7COLD

Upvotes: 5

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210982

First column(s) without names/headers are considered as index column(s).

You should also use index_col parameter properly:

data = pd.read_table('broken.csv',index_col=[0],sep=',')

if your first column contains data instead of index, you can skip first row, specify names for your columns, and instruct read_csv that you don't want to read headers:

cols = ['col1','col2','col3','col4']
data = pd.read_table('broken.csv',sep=',', skiprows=[0], header=None, names=cols)

Upvotes: 2

Related Questions