Reputation: 381
I have a dataframe with 5 columns (first column is ID and 4 are country names)
I want a list of country names for each row, that satisfies a particular condition on that respective country column.
df = {'id':['i1','i2','i3','i4','i5'], 'c1':[3,2,4,1,4], 'c2':[4,2,5,5,5], 'c3':[4,5,3,3,3], 'c4':[5,1,2,2,2]}
In the above case I need a IDs for which rating is 4 and above.
I'm expecting the output to be list of companies for where rating was 4 and above for each ID. Can be a dataframe or a dict.
highest_rated_companies = { 'i1': ['c2', 'c3', 'c4'], 'i2': ['c3'], 'i3': ['c1', 'c2'], 'i4': ['c2'], 'i5': ['c1', 'c2'] }
Upvotes: 0
Views: 101
Reputation: 8790
Another option is to use the to_dict
method. If you set your id
column as the index, you can do:
df = df[df>=4]
d = df.to_dict('index')
output = {ID: [name for name,val in row.items() if not pd.isnull(val)] for ID, row in d.items()}
The last line is to convert the dictionary into the desired format. Time test:
In[0]:
import pandas as pd
df = {'id':['i1','i2','i3','i4','i5'], 'c1':[3,2,4,1,4], 'c2':[4,2,5,5,5], 'c3':[4,5,3,3,3], 'c4':[5,1,2,2,2]}
df = pd.DataFrame(df)
df = df.set_index('id',drop=True)
df = df[df>=4]
%%timeit -n 1000
d = df.to_dict('index')
output = {ID: [name for name,val in row.items() if not pd.isnull(val)] for ID, row in d.items()}
Out[0]
243 µs ± 23.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
This isn't as fast as what @MrNobody33 answered, though:
135 µs ± 4.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Upvotes: 0
Reputation: 6483
You could try something like this, with a to_records
, that seems to be the fastest as you can see here:
First Option
import pandas as pd
import numpy as np
data = {'id':['i1','i2','i3','i4','i5'], 'c1':[3,2,4,1,4], 'c2':[4,2,5,5,5], 'c3':[4,5,3,3,3], 'c4':[5,1,2,2,2]}
df = pd.DataFrame(data)
print(df)
highest_rated_companies={row[1]:[df.columns[idx] for idx,val in enumerate(list(row)[2:],1) if val>=4] for row in df.to_records()}
Second Option
import pandas as pd
data = {'id':['i1','i2','i3','i4','i5'], 'c1':[3,2,4,1,4], 'c2':[4,2,5,5,5], 'c3':[4,5,3,3,3], 'c4':[5,1,2,2,2]}
df = pd.DataFrame(data)
print(df)
highest_rated_companies={row[0]:[df.columns[idx] for idx,val in enumerate(row[1:],1) if val>=4] for i, row in df.iterrows()}
print(highest_rated_companies)
Both outputs:
df:
id c1 c2 c3 c4
0 i1 3 4 4 5
1 i2 2 2 5 1
2 i3 4 5 3 2
3 i4 1 5 3 2
4 i5 4 5 3 2
highest_rated_companies:
{'i1': ['c2', 'c3', 'c4'], 'i2': ['c3'], 'i3': ['c1', 'c2'], 'i4': ['c2'], 'i5': ['c1', 'c2']}
Timestamps:
First Option:
0.0113047
seconds best case, when executed 100
times the script
1.2424291999999468
seconds best case, when executed 10000
times the script
Second Option
0.07292359999996734
seconds best case, when executed 100
times the script
7.821904700000005
seconds best case, when executed 10000
times the script
Edit:
Using dt.to_records()
, seem to be the fastest way, since I tested Ehsan's answer and I got when executed 10000
times the script, a timestamp of 50.64001639999992
seconds, and when executed 100
times the script, a timestamp of 0.5399872999998934
seconds. Even it's faster than the Second Option, the First Option keep being the fastest.
Upvotes: 1
Reputation: 12397
You can do this:
df = pd.DataFrame(df)
keys, values = np.where(df[['c1','c2','c3','c4']].ge(4))
highest_rated_companies = pd.DataFrame({'id':df.iloc[keys].id,'c':df.columns[values+1]})
output:
id c
0 i1 c2
0 i1 c3
0 i1 c4
1 i2 c3
2 i3 c1
2 i3 c2
3 i4 c2
4 i5 c1
4 i5 c2
You can easily convert it to a dictionary if you prefer.
Upvotes: 0