Reputation: 42389
There are a number of questions here on SO regarding pandas
not respecting the order of the columns when reading/writing a csv file, some of them dating back 5 years ago (!):
According to this answer, this "bug" was fixed with version 0.19.0 but I'm running Python 3.6.4 and pandas
0.22.0 and I still encounter this issue.
Is this a bug that's been around for years or is this just how pandas
work? If so, what's the reasoning behind not preserving column order?
You can reproduce the issue with this csv file and the following code:
import pandas as pd
df = pd.read_csv(
"test.csv", usecols=('Author', 'Title', 'Abstract Note', 'Url'))
print(df)
Notice that the 'Url'
is not positioned last in df
as it should.
Upvotes: 4
Views: 3766
Reputation: 294488
I believe this is a misunderstanding of what usecols
does. The documentation doesn't suggest that the columns come back in the same order presented in the argument.
usecols : array-like or callable, default None
Return a subset of the columns. If array-like, all elements must either be positional (i.e. integer indices into the document columns) or strings that correspond to column names provided either by the user in names or inferred from the document header row(s). For example, a valid array-like usecols parameter would be [0, 1, 2] or [‘foo’, ‘bar’, ‘baz’].
If callable, the callable function will be evaluated against the column names, returning names where the callable function evaluates to True. An example of a valid callable argument would be lambda x: x.upper() in ['AAA', 'BBB', 'DDD']. Using this parameter results in much faster parsing time and lower memory usage.
In fact the columns came back in the same order that they are present in the file.
cols = ['Author', 'Title', 'Abstract Note', 'Url']
with open('test.csv') as fh:
print('\n'.join(filter(lambda x: x in cols, fh.readline().split(','))))
Author
Title
Url
Abstract Note
And when we read the file:
df = pd.read_csv(
"test.csv", usecols=('Author', 'Title', 'Abstract Note', 'Url'))
df.columns
Index(['Author', 'Title', 'Url', 'Abstract Note'], dtype='object')
We see the same column order.
Instead, slice the resulting dataframe with columns in the order you want.
cols = ['Author', 'Title', 'Abstract Note', 'Url']
df = pd.read_csv('test.csv', usecols=cols)[cols]
df.columns
Index(['Author', 'Title', 'Abstract Note', 'Url'], dtype='object')
Upvotes: 2