Baptiste Nguyen
Baptiste Nguyen

Reputation: 57

Panda dataframe Creation a new column by comparing all other row

I have the following example:

import pandas as pd
import numpy as np
import time

def function(value,df):
    return len(df[(df['A']<value)])

df= pd.DataFrame(np.random.randint(0,100,size=(30000, 1)), columns=['A'])

start=time.time()
df['B']=pd.Series([len(df[df['A']<value]) for value in df['A']])
end=time.time()
print("list comprehension time:",end-start)

start=time.time()
df['B']=df['A'].apply(function,df=df)
end=time.time()
print("apply time:",end-start)

start=time.time()
series = []
for index, row in df.iterrows():
    series.append(len(df[df['A']<row['A']]))
df['B'] = series
end=time.time()
print("loop time:",end-start)

Output:

time: 19.54859232902527
time: 23.598857402801514
time: 26.441001415252686

This example create a new column by counting all the row which value is superior to the current value of the row.

For this type of issue (when I created a new column, after comparing for a row all other row of the dataframe), I have tried the apply function,list comprehension and classic loop but I think they are slow.

Is there a faster way?

Ps: A specialized solution for this example is not the thing which interested me the most. I prefer a general solution for this type of issue.

An another example can be: for a dataframe with a columns of string,create a new column by counting for each row the number of string in the dataframe which begin by the string first letter.

Upvotes: 1

Views: 41

Answers (2)

Quang Hoang
Quang Hoang

Reputation: 150745

In general, broadcasting as Wen's solution is generally the fastest. In this case, looks like rank does the job.

np.random.seed(1)
df= pd.DataFrame(np.random.randint(0,100,size=(30000, 1)), columns=['A'])

%timeit df.A.rank()-1
2.71 ms ± 119 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Upvotes: 0

BENY
BENY

Reputation: 323266

Usually I am using numpy broadcast for this type task

%timeit df['B']=pd.Series([len(df[df['A']<value]) for value in df['A']])
1 loop, best of 3: 25.4 s per loop
%timeit df['B']=(df.A.values<df.A.values[:,None]).sum(1)
1 loop, best of 3: 1.74 s per loop


#df= pd.DataFrame(np.random.randint(0,100,size=(30000, 1)), columns=['A'])

Upvotes: 1

Related Questions