PedroA
PedroA

Reputation: 1925

pandas read_csv usecols and names out of sync

When trying to read some columns using their indices from a tabular file with pandas read_csv it seems the usecols and names get out of sync with each other.

For example, having the file test.csv:

FOO A   -46450.494736   0.0728830817231
FOO A   -46339.7126846  0.0695018062805
FOO A   -46322.4942905  0.0866205763556
FOO B   -46473.3117983  0.0481618121947
FOO B   -46537.6827055  0.0436893868921
FOO B   -46467.2102205  0.0485001911304
BAR C   -33424.1224914  6.7981041851
BAR C   -33461.4101485  7.40607068177
BAR C   -33404.6396495  4.72117502707

and trying to read 3 columns by index without preserving the original order:

cols = [1, 2, 0]
names = ['X', 'Y', 'Z']

df = pd.read_csv(
                'test.csv', sep='\t',
                header=None,
                index_col=None,
                usecols=cols, names=names)

I'm getting the following dataframe:

     X  Y             Z
0  FOO  A -46450.494736
1  FOO  A -46339.712685
2  FOO  A -46322.494290
3  FOO  B -46473.311798
4  FOO  B -46537.682706
5  FOO  B -46467.210220
6  BAR  C -33424.122491
7  BAR  C -33461.410148
8  BAR  C -33404.639650

whereas I would expect column Z to have the FOO and BAR, like this:

     Z  X             Y
0  FOO  A -46450.494736
1  FOO  A -46339.712685
2  FOO  A -46322.494290
3  FOO  B -46473.311798
4  FOO  B -46537.682706
5  FOO  B -46467.210220
6  BAR  C -33424.122491
7  BAR  C -33461.410148
8  BAR  C -33404.639650

I know pandas stores the dataframes as dictionary so the order of the columns may be different from the requested with usecols, but the problem here is that using usecols with indices and names doesn't make sense.

I really need to read the columns by their indices and then assign names to them. Is there any workaround for this?

Upvotes: 0

Views: 3598

Answers (1)

chrisb
chrisb

Reputation: 52276

The documentation could be clearer on this (feel free to make an issue, or even better submit a pull request!) but usecols is set-like - it does not define an order of columns, it simply is tested against for membership.

from io import StringIO

pd.read_csv(StringIO("""a,b,c
1,2,3
4,5,6"""), usecols=[0, 1, 2])

Out[31]: 
   a  b  c
0  1  2  3
1  4  5  6

pd.read_csv(StringIO("""a,b,c
1,2,3
4,5,6"""), usecols=[2, 1, 0])

Out[32]: 
   a  b  c
0  1  2  3
1  4  5  6

names on the other hand is ordered. So in this case, the answer is to specify the names in the order you want them.

Upvotes: 2

Related Questions