Samesand
Samesand

Reputation: 111

comparing row values to entire data frame

I need to know how many rows in a dataframe match certain conditions of each row.

For example, for each row I need to count how many rows have the same type and have a start time less than the start time of the row and have and end time greater than the end time of the row.

index type start end
0     A    5     10
1     B    3     7
2     A    10    15
3     A    3     14

In this case for row 0, the count should be 1 because row 3 is the same type, it starts at 3 which is lower than 5 and ends at 14, which is higher than 10.

The output should be:

index type start end count
0     A    5     10  1
1     B    3     7   0
2     A    10    15  0
3     A    3     14  0

I am doing this by using lambda:

df['counts'] = df.apply(lambda x: len((df['type']==x[0]) & (df['start']<x[1]) & (df['end']>x[2])]), axis=1)

But this takes a long time, and I want to do the same analysis for a dataframe that is exponentially bigger.

Upvotes: 0

Views: 96

Answers (3)

Akaisteph7
Akaisteph7

Reputation: 6476

Your original version didn't work for me but this did:

df['counts'] = df.apply(lambda x: ((df['type']==x["type"]) & (df['start']<x["start"]) & (df['end']>x["end"])).sum(), axis=1)

I think user3483203's answer is good but, as stated, might be an issue in terms of memory. Here is another way that should be a bit faster than what you have:

df = pd.DataFrame({'type':['A','B','A','A'], 'start':[5,3,10,3], 'end':[10,7,15,14]})

df["counts"] = 0
for rows in df.iterrows():
    row = rows[1]
    match = df['type']==row["type"]
    temp = df[match]
    less = temp['start']<row["start"]
    temp = temp[less]
    final = temp['end']>row["end"]
    df.at[rows[0], "counts"] = final.sum()

Upvotes: 0

kkawabat
kkawabat

Reputation: 1677

You can first use groupby() so you only compare within "type" that would reduce your search space as well as an extra boolean comparison (i.e. (df['type']==x[0]))

import pandas

df = pandas.DataFrame({'type':['A','B','A','A'], 'start':[5,3,10,3], 'end':[10,7,15,14]})

def get_count(df_type):
    return df_type.apply(lambda x:sum((x['start']>df_type['start'])&(x['end']<df_type['end'])), axis=1)
df['count'] = df.groupby('type').apply(get_count).reset_index(drop=True)

Upvotes: 2

user3483203
user3483203

Reputation: 51175

np.(less|greater|equal).outer

Beware the memory complexity


u = df[['type', 'start', 'end']].to_records()

m1 = np.less.outer(u['start'], u['start'])
m2 = np.greater.outer(u['end'], u['end'])
m3 = np.equal.outer(u['type'], u['type'])

(m1 & m2 & m3).sum(0)

array([1, 0, 0, 0])

Upvotes: 2

Related Questions