Salih
Salih

Reputation: 381

Efficient (& perhaps shortest) way of getting a list of column names that satisfies a condition, for each row

I have a dataframe with 5 columns (first column is ID and 4 are country names)

I want a list of country names for each row, that satisfies a particular condition on that respective country column.

enter image description here

df = {'id':['i1','i2','i3','i4','i5'], 'c1':[3,2,4,1,4], 'c2':[4,2,5,5,5], 'c3':[4,5,3,3,3], 'c4':[5,1,2,2,2]}

In the above case I need a IDs for which rating is 4 and above.

I'm expecting the output to be list of companies for where rating was 4 and above for each ID. Can be a dataframe or a dict.

highest_rated_companies = { 'i1': ['c2', 'c3', 'c4'], 'i2': ['c3'], 'i3': ['c1', 'c2'], 'i4': ['c2'], 'i5': ['c1', 'c2'] }

Upvotes: 0

Views: 101

Answers (3)

Tom
Tom

Reputation: 8790

Another option is to use the to_dict method. If you set your id column as the index, you can do:

df = df[df>=4]
d = df.to_dict('index')
output = {ID: [name for name,val in row.items() if not pd.isnull(val)] for ID, row in d.items()}

The last line is to convert the dictionary into the desired format. Time test:

In[0]:
import pandas as pd

df = {'id':['i1','i2','i3','i4','i5'], 'c1':[3,2,4,1,4], 'c2':[4,2,5,5,5], 'c3':[4,5,3,3,3], 'c4':[5,1,2,2,2]}

df = pd.DataFrame(df)
df = df.set_index('id',drop=True)
df = df[df>=4]

%%timeit -n 1000
d = df.to_dict('index')
output = {ID: [name for name,val in row.items() if not pd.isnull(val)] for ID, row in d.items()}

Out[0]
243 µs ± 23.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

This isn't as fast as what @MrNobody33 answered, though: 135 µs ± 4.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Upvotes: 0

MrNobody33
MrNobody33

Reputation: 6483

You could try something like this, with a to_records, that seems to be the fastest as you can see here:

First Option

import pandas as pd
import numpy as np
data = {'id':['i1','i2','i3','i4','i5'], 'c1':[3,2,4,1,4], 'c2':[4,2,5,5,5], 'c3':[4,5,3,3,3], 'c4':[5,1,2,2,2]}

df = pd.DataFrame(data)
print(df)
highest_rated_companies={row[1]:[df.columns[idx] for idx,val in enumerate(list(row)[2:],1) if val>=4] for row in df.to_records()}

Second Option

import pandas as pd

data = {'id':['i1','i2','i3','i4','i5'], 'c1':[3,2,4,1,4], 'c2':[4,2,5,5,5], 'c3':[4,5,3,3,3], 'c4':[5,1,2,2,2]}

df = pd.DataFrame(data)
print(df)
highest_rated_companies={row[0]:[df.columns[idx] for idx,val in enumerate(row[1:],1) if val>=4] for i, row in df.iterrows()}
print(highest_rated_companies)

Both outputs:

df:

   id  c1  c2  c3  c4
0  i1   3   4   4   5
1  i2   2   2   5   1
2  i3   4   5   3   2
3  i4   1   5   3   2
4  i5   4   5   3   2


highest_rated_companies:

{'i1': ['c2', 'c3', 'c4'], 'i2': ['c3'], 'i3': ['c1', 'c2'], 'i4': ['c2'], 'i5': ['c1', 'c2']}

Timestamps:

First Option:

  • 0.0113047 seconds best case, when executed 100 times the script

  • 1.2424291999999468 seconds best case, when executed 10000 times the script

Second Option

  • 0.07292359999996734 seconds best case, when executed 100 times the script

  • 7.821904700000005 seconds best case, when executed 10000 times the script

Edit:

Using dt.to_records(), seem to be the fastest way, since I tested Ehsan's answer and I got when executed 10000 times the script, a timestamp of 50.64001639999992 seconds, and when executed 100 times the script, a timestamp of 0.5399872999998934 seconds. Even it's faster than the Second Option, the First Option keep being the fastest.

Upvotes: 1

Ehsan
Ehsan

Reputation: 12397

You can do this:

df = pd.DataFrame(df)
keys, values = np.where(df[['c1','c2','c3','c4']].ge(4))
highest_rated_companies = pd.DataFrame({'id':df.iloc[keys].id,'c':df.columns[values+1]})

output:

   id   c
0  i1  c2
0  i1  c3
0  i1  c4
1  i2  c3
2  i3  c1
2  i3  c2
3  i4  c2
4  i5  c1
4  i5  c2

You can easily convert it to a dictionary if you prefer.

Upvotes: 0

Related Questions