Gabriel
Gabriel

Reputation: 42389

Why is column order not preserved when handling a csv file with pandas?

There are a number of questions here on SO regarding pandas not respecting the order of the columns when reading/writing a csv file, some of them dating back 5 years ago (!):

According to this answer, this "bug" was fixed with version 0.19.0 but I'm running Python 3.6.4 and pandas 0.22.0 and I still encounter this issue.

Is this a bug that's been around for years or is this just how pandas work? If so, what's the reasoning behind not preserving column order?


You can reproduce the issue with this csv file and the following code:

import pandas as pd
df = pd.read_csv(
    "test.csv", usecols=('Author', 'Title', 'Abstract Note', 'Url'))
print(df)

Notice that the 'Url' is not positioned last in df as it should.

Upvotes: 4

Views: 3766

Answers (1)

piRSquared
piRSquared

Reputation: 294488

I believe this is a misunderstanding of what usecols does. The documentation doesn't suggest that the columns come back in the same order presented in the argument.

usecols : array-like or callable, default None

Return a subset of the columns. If array-like, all elements must either be positional (i.e. integer indices into the document columns) or strings that correspond to column names provided either by the user in names or inferred from the document header row(s). For example, a valid array-like usecols parameter would be [0, 1, 2] or [‘foo’, ‘bar’, ‘baz’].

If callable, the callable function will be evaluated against the column names, returning names where the callable function evaluates to True. An example of a valid callable argument would be lambda x: x.upper() in ['AAA', 'BBB', 'DDD']. Using this parameter results in much faster parsing time and lower memory usage.

In fact the columns came back in the same order that they are present in the file.

cols = ['Author', 'Title', 'Abstract Note', 'Url']

with open('test.csv') as fh:
    print('\n'.join(filter(lambda x: x in cols, fh.readline().split(','))))

Author
Title
Url
Abstract Note

And when we read the file:

df = pd.read_csv(
    "test.csv", usecols=('Author', 'Title', 'Abstract Note', 'Url'))

df.columns

Index(['Author', 'Title', 'Url', 'Abstract Note'], dtype='object')

We see the same column order.

Instead, slice the resulting dataframe with columns in the order you want.

cols = ['Author', 'Title', 'Abstract Note', 'Url']
df = pd.read_csv('test.csv', usecols=cols)[cols]

df.columns

Index(['Author', 'Title', 'Abstract Note', 'Url'], dtype='object')

Upvotes: 2

Related Questions