Julien Marrec
Julien Marrec

Reputation: 11905

Pandas read multiindexed csv with blanks

I'm struggling with properly loading a csv that has a multi lines header with blanks. The CSV looks like this:

,,C,,,D,,
A,B,X,Y,Z,X,Y,Z
1,2,3,4,5,6,7,8

CSV Header

What I would like to get is:

Desired Pandas Header

When I try to load with pd.read_csv(file, header=[0,1], sep=','), I end up with the following:

Incorrect result

Is there a way to get the desired result?


Note: alternatively, I would accept this as a result:

Alternative result


Versions used:

Upvotes: 6

Views: 2531

Answers (6)

v8xi
v8xi

Reputation: 11

Import your csv file providing the header row indexes:

df = pd.read_csv('file.csv', header=[0, 1, 2])

Then, you can iterate over each column header, clean it up, assign it to a tuple, the re-assign the dataframe columns using pd.MultiIndex.from_tuples(list_of_tuples)

df.columns = pd.MultiIndex.from_tuples(
[tuple(['' if y.find('Unnamed')==0 else y for y in x]) for x in df.columns]
)

this is the quick one liner I was looking for when trying to figure this out.

Upvotes: 0

Sergi Sergiev
Sergi Sergiev

Reputation: 21

I used a technique to flatten from the multi-index columns and make one column. It works well for me.

your_df.columns = ['_'.join(col).strip() for col in your_df.columns.values]

Upvotes: 0

unutbu
unutbu

Reputation: 880399

Here is an automated way to fix the column index. First, pull the column level values into a DataFrame:

columns = pd.DataFrame(df.columns.tolist())

then rename the Unnamed: columns to NaN:

columns.loc[columns[0].str.startswith('Unnamed:'), 0] = np.nan

and then forward-fill the NaNs:

columns[0] = columns[0].fillna(method='ffill')

so that columns now looks like

In [314]: columns
Out[314]: 
     0  1
0  NaN  A
1  NaN  B
2    C  X
3    C  Y
4    C  Z
5    D  X
6    D  Y
7    D  Z

Now we can find the remaining NaNs and fill them with empty strings:

mask = pd.isnull(columns[0])
columns[0] = columns[0].fillna('')

To make the first two columns, A and B, indexable as df['A'] and df['B'] -- as though they were single-leveled -- you could swap the values in the first and second columns:

columns.loc[mask, [0,1]] = columns.loc[mask, [1,0]].values

Now you can build a new MultiIndex and assign it to df.columns:

df.columns = pd.MultiIndex.from_tuples(columns.to_records(index=False).tolist())

Putting it all together, if data is

,,C,,,D,,
A,B,X,Y,Z,X,Y,Z
1,2,3,4,5,6,7,8
3,4,5,6,7,8,9,0

then

import numpy as np
import pandas as pd
df = pd.read_csv('data', header=[0,1], sep=',')
columns = pd.DataFrame(df.columns.tolist())
columns.loc[columns[0].str.startswith('Unnamed:'), 0] = np.nan
columns[0] = columns[0].fillna(method='ffill')
mask = pd.isnull(columns[0])
columns[0] = columns[0].fillna('')
columns.loc[mask, [0,1]] = columns.loc[mask, [1,0]].values
df.columns = pd.MultiIndex.from_tuples(columns.to_records(index=False).tolist())
    print(df)

yields

   A  B  C        D      
         X  Y  Z  X  Y  Z
0  1  2  3  4  5  6  7  8
1  3  4  5  6  7  8  9  0

Upvotes: 8

Julien Marrec
Julien Marrec

Reputation: 11905

Load the dataframe, with multiindex:

df = pd.read_csv(filelist,header=[0,1], sep=',')

Write a function to replace the index:

def replace_index(df):
    arr = df.columns.values
    l = [list(x) for x in arr]
    for i in range(len(l)):
        if l[i][0][:7] == 'Unnamed':
            if l[i-1][0][:7] != 'Unnamed':
                l[i][0] = l[i-1][0]
    for i in range(len(l)):
        if l[i][0][:7] == 'Unnamed':
                l[i][0] = l[i][1]
                l[i][1] = ''
    index = pd.MultiIndex.from_tuples(l)
    df.columns = index
    return df

Return the new dataframe properly indexed:

replace_index(df)

Upvotes: 0

firelynx
firelynx

Reputation: 32224

There is no magical way of making pandas aware of how you want your index to look, the closest way you can do this is by specifying a lot yourself, like this:

names = ['A', 'B', 
         ('C','X'), ('C', 'Y'), ('C', 'Z'),
         ('D','X'), ('D','Y'), ('D', 'Z')]
pd.read_csv(file, mangle_dupe_cols=True,
            header=1, names=names, index_col=[0, 1])

Gives:

     C        D      
     X  Y  Z  X  Y  Z
A B                  
1 2  3  4  5  6  7  8

To do this in a dynamic fashion, you could read the first two lines of the CSV as they are and loop through the columns you get to generate the names variable dynamically before loading the full dataset.

pd.read_csv(file, nrows=1, header=[0,1], index_col=[0, 1])

Then access the columns and loop to create your header. Again, not a very clean solution, but should work.

Upvotes: 2

sachin saxena
sachin saxena

Reputation: 976

you can read using :

    df = pd.read_csv('file.csv', header=[0, 1], skipinitialspace=True, tupleize_cols=True)

and then

    df.columns = pd.MultiIndex.from_tuples(df.columns)

Upvotes: 0

Related Questions