Pandas setting column subset slow

Question

I had to cast a subset of columns of a big DataFrame in pandas... it was very slow. So I made a few tests and discovered that the casting itself is done very fast. But Pandas seems to be slow when attributing the newly casted values to the old DataFrame.

I then came up with another solution performing a join and avoiding attributing to a column subset which runs pretty fast.

Why is pandas so slow? Might this be a bug? Can anyone reproduce the results?

Edit:

More tests and the code used to produce the DataFrame.

chrisb · Accepted Answer

There was just a doc note added about this - see here.

Basically you don't want to use loc when casting - instead do:

df[f] = df[f].astype(float)

Also, fyi the copy=False doesn't do any harm here, but it doesn't do any good either - going from ints to floats you're going to have to allocate a new array.

Edit - this was slower than I thought. Here's something of a workaround:

In [61]: df = pd.DataFrame(np.random.randint(0,1000, size=(10000, 1026)))

In [62]: f = list(range(1024))

In [63]: def cast(s):
    ...:     if s.name in f:
    ...:         return s.astype(float)
    ...:     else:
    ...:         return s

In [64]: %timeit df.apply(cast)
1 loop, best of 3: 389 ms per loop

Pandas setting column subset slow

Edit:

Answers (2)

Related Questions