Alan Höng
Alan Höng

Reputation: 113

Pandas setting column subset slow

I had to cast a subset of columns of a big DataFrame in pandas... it was very slow. So I made a few tests and discovered that the casting itself is done very fast. But Pandas seems to be slow when attributing the newly casted values to the old DataFrame.

I then came up with another solution performing a join and avoiding attributing to a column subset which runs pretty fast.

Why is pandas so slow? Might this be a bug? Can anyone reproduce the results?

slow pandas

Edit:

More tests and the code used to produce the DataFrame.

slow pandas 2

Upvotes: 1

Views: 1475

Answers (2)

user1689987
user1689987

Reputation: 1546

dropping the column before resetting it speeds up the time, can also try to use np.arry:

column_names = newShortEntries.select_dtypes(include=[object]).columns
temp =  newShortEntries[column_names].astype(bool) #np.array(newShortEntries[column_names], dtype=np.bool_)
newShortEntries = newShortEntries.drop(columns=column_names)
newShortEntries[column_names] = temp 

Upvotes: 0

chrisb
chrisb

Reputation: 52266

There was just a doc note added about this - see here.

Basically you don't want to use loc when casting - instead do:

df[f] = df[f].astype(float)

Also, fyi the copy=False doesn't do any harm here, but it doesn't do any good either - going from ints to floats you're going to have to allocate a new array.

Edit - this was slower than I thought. Here's something of a workaround:

In [61]: df = pd.DataFrame(np.random.randint(0,1000, size=(10000, 1026)))

In [62]: f = list(range(1024))

In [63]: def cast(s):
    ...:     if s.name in f:
    ...:         return s.astype(float)
    ...:     else:
    ...:         return s

In [64]: %timeit df.apply(cast)
1 loop, best of 3: 389 ms per loop

Upvotes: 1

Related Questions