Reputation: 1051
I have a pandas DataFrame with 1,000 columns and 30 million sample rows. I need to perform some operations(lets say addition,multiplication etc.,) on each column. If some value in any column after operation changes to 0, then I need to stop applying operations further on remaining columns and rows. Also, I would like to know at which column and row it changed to 0.
I have used iterrows with a few checks, but there is a performance issue as there is lots of data. Also, is there any alternatives to apply, iterrows?
ID PID PC TID
10 1005 8017 3
11 10335 5019 2
12 1000 8017 1
13 243 8870 1
14 4918 8305 3
15 9017 8305 3
Apply operations column-wise:
After doing apply on second column, 3rd value is 0 and then whole process should be stopped and return the 2nd column 3rd row.
Output: If Column wise operations are performed:
ID PID PC TID
1 5 8017 3
2 9335 5019 2
3 0 8017 1
4 243 8870 1
5 4918 8305 3
6 9017 8305 3
If row wise operations are performed :
ID PID PC TID
1 5 80.17 2
2 9335 50.19 1
3 0 8017 1
13 243 8870 1
14 4918 8305 3
15 9017 8305 3
Upvotes: 3
Views: 1094
Reputation: 323326
This is my solution as I mention in the comment
df1=df.copy()
df['PID']-=1000;df['PC']/=9;df['TID']-=1;df['ID']-=9
s=df.eq(0).idxmax(axis=0)
s
Out[492]:
ID 0
PID 2
PC 0
TID 2
dtype: int64
for x ,i in s.iteritems():
df.loc[i:,x]=df1.loc[i:,x]
Upvotes: 1
Reputation: 76346
Considering how you have many more rows than columns, and that vectorized ops are so much faster, I'd suggest the following:
for c in df.columns:
res = <apply function on df[c]>
if (res != 0).all(): # No zero found
df[c] = res
continue
# Zero found - apply only up to it.
df[c] = res[(res != 0).astype(int).cumsum() == 0] # Apply up to first 0
break
Upvotes: 1