Reputation: 113
I had to cast a subset of columns of a big DataFrame in pandas... it was very slow. So I made a few tests and discovered that the casting itself is done very fast. But Pandas seems to be slow when attributing the newly casted values to the old DataFrame.
I then came up with another solution performing a join and avoiding attributing to a column subset which runs pretty fast.
Why is pandas so slow? Might this be a bug? Can anyone reproduce the results?
More tests and the code used to produce the DataFrame.
Upvotes: 1
Views: 1475
Reputation: 1546
dropping the column before resetting it speeds up the time, can also try to use np.arry:
column_names = newShortEntries.select_dtypes(include=[object]).columns
temp = newShortEntries[column_names].astype(bool) #np.array(newShortEntries[column_names], dtype=np.bool_)
newShortEntries = newShortEntries.drop(columns=column_names)
newShortEntries[column_names] = temp
Upvotes: 0
Reputation: 52266
There was just a doc note added about this - see here.
Basically you don't want to use loc
when casting - instead do:
df[f] = df[f].astype(float)
Also, fyi the copy=False
doesn't do any harm here, but it doesn't do any good either - going from ints to floats you're going to have to allocate a new array.
Edit - this was slower than I thought. Here's something of a workaround:
In [61]: df = pd.DataFrame(np.random.randint(0,1000, size=(10000, 1026)))
In [62]: f = list(range(1024))
In [63]: def cast(s):
...: if s.name in f:
...: return s.astype(float)
...: else:
...: return s
In [64]: %timeit df.apply(cast)
1 loop, best of 3: 389 ms per loop
Upvotes: 1