Reputation: 381

Efficient (& perhaps shortest) way of getting a list of column names that satisfies a condition, for each row

I have a dataframe with 5 columns (first column is ID and 4 are country names)

I want a list of country names for each row, that satisfies a particular condition on that respective country column.

df = {'id':['i1','i2','i3','i4','i5'], 'c1':[3,2,4,1,4], 'c2':[4,2,5,5,5], 'c3':[4,5,3,3,3], 'c4':[5,1,2,2,2]}

In the above case I need a IDs for which rating is 4 and above.

I'm expecting the output to be list of companies for where rating was 4 and above for each ID. Can be a dataframe or a dict.

highest_rated_companies = { 'i1': ['c2', 'c3', 'c4'], 'i2': ['c3'], 'i3': ['c1', 'c2'], 'i4': ['c2'], 'i5': ['c1', 'c2'] }

Upvotes: 0

Answers (3)

Tom

Reputation: 8790

Another option is to use the to_dict method. If you set your id column as the index, you can do:

df = df[df>=4]
d = df.to_dict('index')
output = {ID: [name for name,val in row.items() if not pd.isnull(val)] for ID, row in d.items()}

The last line is to convert the dictionary into the desired format. Time test:

In[0]:
import pandas as pd

df = {'id':['i1','i2','i3','i4','i5'], 'c1':[3,2,4,1,4], 'c2':[4,2,5,5,5], 'c3':[4,5,3,3,3], 'c4':[5,1,2,2,2]}

df = pd.DataFrame(df)
df = df.set_index('id',drop=True)
df = df[df>=4]

%%timeit -n 1000
d = df.to_dict('index')
output = {ID: [name for name,val in row.items() if not pd.isnull(val)] for ID, row in d.items()}

Out[0]
243 µs ± 23.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

This isn't as fast as what @MrNobody33 answered, though: 135 µs ± 4.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Upvotes: 0

MrNobody33

Reputation: 6483

You could try something like this, with a to_records, that seems to be the fastest as you can see here:

First Option

import pandas as pd
import numpy as np
data = {'id':['i1','i2','i3','i4','i5'], 'c1':[3,2,4,1,4], 'c2':[4,2,5,5,5], 'c3':[4,5,3,3,3], 'c4':[5,1,2,2,2]}

df = pd.DataFrame(data)
print(df)
highest_rated_companies={row[1]:[df.columns[idx] for idx,val in enumerate(list(row)[2:],1) if val>=4] for row in df.to_records()}

Second Option

import pandas as pd

data = {'id':['i1','i2','i3','i4','i5'], 'c1':[3,2,4,1,4], 'c2':[4,2,5,5,5], 'c3':[4,5,3,3,3], 'c4':[5,1,2,2,2]}

df = pd.DataFrame(data)
print(df)
highest_rated_companies={row[0]:[df.columns[idx] for idx,val in enumerate(row[1:],1) if val>=4] for i, row in df.iterrows()}
print(highest_rated_companies)

Both outputs:

df:

   id  c1  c2  c3  c4
0  i1   3   4   4   5
1  i2   2   2   5   1
2  i3   4   5   3   2
3  i4   1   5   3   2
4  i5   4   5   3   2


highest_rated_companies:

{'i1': ['c2', 'c3', 'c4'], 'i2': ['c3'], 'i3': ['c1', 'c2'], 'i4': ['c2'], 'i5': ['c1', 'c2']}

Timestamps:

First Option:

0.0113047 seconds best case, when executed 100 times the script
1.2424291999999468 seconds best case, when executed 10000 times the script

Second Option

0.07292359999996734 seconds best case, when executed 100 times the script
7.821904700000005 seconds best case, when executed 10000 times the script

Edit:

Using dt.to_records(), seem to be the fastest way, since I tested Ehsan's answer and I got when executed 10000 times the script, a timestamp of 50.64001639999992 seconds, and when executed 100 times the script, a timestamp of 0.5399872999998934 seconds. Even it's faster than the Second Option, the First Option keep being the fastest.

Upvotes: 1

Ehsan

Reputation: 12397

You can do this:

df = pd.DataFrame(df)
keys, values = np.where(df[['c1','c2','c3','c4']].ge(4))
highest_rated_companies = pd.DataFrame({'id':df.iloc[keys].id,'c':df.columns[values+1]})

output:

   id   c
0  i1  c2
0  i1  c3
0  i1  c4
1  i2  c3
2  i3  c1
2  i3  c2
3  i4  c2
4  i5  c1
4  i5  c2

You can easily convert it to a dictionary if you prefer.

Upvotes: 0

Efficient (&amp; perhaps shortest) way of getting a list of column names that satisfies a condition, for each row

Answers (3)

Related Questions

Efficient (& perhaps shortest) way of getting a list of column names that satisfies a condition, for each row