Mainland
Mainland

Reputation: 4564

Python Dataframe Compute Min and Max for list of elements

I have a big dataframe with 2000 rows, two columns, each column row consisting a list with around 1000 points. I want to drop negative values in both columns together and then compute min and max. Presently I am doing through for loop and it is taking 30 minutes to complete. Can I do the same operation through vectorized operations?

Expected solution approach:

df = pd.DataFrame({'x':[[-1,0,1,2,10],[1.5,2,4,5]],'y':[[2.5,2.4,2.3,1.5,0.1],[5,4.5,3,-0.1]]})
df = 
                   x                          y
0  [-1, 0, 1, 2, 10]  [2.5, 2.4, 2.3, 1.5, 0.1] 
1     [1.5, 2, 4, 5]          [5, 4.5, 3, -0.1]
### x, y are paired data coming from field. Ex, (-1,2.5), (0,2.4)
# First step: drop negative values in both x and y columns. 
# Find a negative x or y and drop the pair.
# Ex, in first row, drop (-1,2.5) pair. That is, -1 in x and 2.5 in y.
# After dropping negative values
df = 
         x                          y
0  [0, 1, 2, 10]  [2.4, 2.3, 1.5, 0.1] 
1     [1.5, 2, 4]          [5, 4.5, 3]

### Setp2: Find Max in each column
df = 
           x                  y               xmax    ymax
0     [0, 1, 2, 10]  [2.4, 2.3, 1.5, 0.1]      10      2.4
1     [1.5, 2, 4]    [5, 4.5, 3]               4       5

### Setp3: Find y@xmax, x@ymax in each column
df = 
           x                  y               xmax    ymax   y@xmax   x@ymax
0     [0, 1, 2, 10]  [2.4, 2.3, 1.5, 0.1]      10      2.4     0.1      0
1     [1.5, 2, 4]    [5, 4.5, 3]               4       5       3        1.5

Present solution: It is working but it is taking huge time.

for i in range(len(df)):
   ### create an auxiliary dataframe
   auxdf = pd.DataFrame({'x':df['x'].loc[i],'y':df['y'].loc[i]})
   ## Step1: drop negative values
   auxdf = auxdf[(auxdf['x']>0)&(auxdf['y']>0)]
   ### Step2: Max in x and y
   xmax = auxdf['x'].max()
   ymax = auxdf['y'].max()
   ### Step3: x@ymax, y@xmax
   xatymax = auxdf['x'].loc[auxdf['y'].idxmax()]
   yatxmax = auxdf['y'].loc[auxdf['x'].idxmax()]
   ### finally I append xmax,ymax,xatymax,yatxmax to the df

Doing this vectorized operation will minimize the time?

Upvotes: 2

Views: 259

Answers (1)

Shubham Sharma
Shubham Sharma

Reputation: 71689

Solution with numpy

def fast():
    for v in df[['x', 'y']].to_numpy():
        a = np.array([*v])
        a = a[:, (a >= 0).all(axis=0)]
        i = a.argmax(1)
        yield (*a[[0, 1], i], *a[[1, 0], i])


df[['xmax', 'ymax', 'y@xmax', 'x@ymax']] = list(fast())

Result

print(df)

                   x                          y  xmax  ymax  y@xmax  x@ymax
0  [-1, 0, 1, 2, 10]  [2.5, 2.4, 2.3, 1.5, 0.1]  10.0   2.4     0.1     0.0
1     [1.5, 2, 4, 5]          [5, 4.5, 3, -0.1]   4.0   5.0     3.0     1.5

Performance

On sample dataframe with 20000 rows

df = pd.concat([df] * 20000, ignore_index=True)

%%timeit
_ = list(fast())
# 1.10 s ± 112 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Upvotes: 2

Related Questions