Reputation: 4292
I have a Pandas dataframe that contains a large number of variables. This can be simplified as:
tempDF = pd.DataFrame({ 'var1': [12,12,12,12,45,45,45,51,51,51],
'var2': ['a','a','b','b','b','b','b','c','c','d'],
'var3': ['e','f','f','f','f','g','g','g','g','g'],
'var4': [1,2,3,3,4,5,6,6,6,7]})
If I wanted to select a subset of the dataframe (e.g. var2='b' and var4=3), I would use:
tempDF.loc[(tempDF['var2']=='b') & (tempDF['var4']==3),:]
However, is it possible to select a subset of the dataframe if the matching criteria are stored within a dict, such as:
tempDict = {'var2': 'b','var4': 3}
It's important that the variable names are not predefined and the number of variables included in the dict is changeable.
I've been puzzling over this for a while and so any suggestions would be greatly appreciated.
Upvotes: 7
Views: 3459
Reputation: 41
I found this to be almost as fast as using query and you don't need to create a string.
df[df.apply(lambda r : all(r[t]==m for t,m in your_dict.items()), axis=1)]
Upvotes: 0
Reputation: 36
Here's a function I have in my personal utils which accepts single values or lists to subset on:
def subsetdict(df, sdict):
subsetter_list = [df[i].isin([j]) if not isinstance(j, list) else df[i].isin(j) for i, j in sdict.items()]
subsetter = pd.concat(subsetter_list, axis=1).all(1)
return df.loc[subsetter, :]
Upvotes: 0
Reputation: 109666
You can evaluate a series of conditions. They don't have to be just an equality.
df = tempDF
d = tempDict
# `repr` returns the string representation of an object.
>>> df[eval(" & ".join(["(df['{0}'] == {1})".format(col, repr(cond))
for col, cond in d.iteritems()]))]
var1 var2 var3 var4
2 12 b f 3
3 12 b f 3
Looking at what eval
does here:
conditions = " & ".join(["(df['{0}'] == {1})".format(col, repr(cond))
for col, cond in d.iteritems()])
>>> conditions
"(df['var4'] == 3) & (df['var2'] == 'b')"
>>> eval(conditions)
0 False
1 False
2 True
3 True
4 False
5 False
6 False
7 False
8 False
9 False
dtype: bool
Here is another example using an equality constraint:
>>> eval(" & ".join(["(df['{0}'] == {1})".format(col, repr(cond))
for col, cond in d.iteritems()]))
d = {'var2': ('==', "'b'"),
'var4': ('>', 3)}
>>> df[eval(" & ".join(["(df['{0}'] {1} {2})".format(col, cond[0], cond[1])
for col, cond in d.iteritems()]))]
var1 var2 var3 var4
4 45 b f 4
5 45 b g 5
6 45 b g 6
Another alternative is to use query
:
qry = " & ".join('{0} {1} {2}'.format(k, cond[0], cond[1]) for k, cond in d.iteritems())
>>> qry
"var4 > 3 & var2 == 'b'"
>>> df.query(qry)
var1 var2 var3 var4
4 45 b f 4
5 45 b g 5
6 45 b g 6
Upvotes: 3
Reputation: 76967
Here's one way to build up conditions from tempDict
In [25]: tempDF.loc[pd.np.all([tempDF[k] == tempDict[k] for k in tempDict], axis=0), :]
Out[25]:
var1 var2 var3 var4
2 12 b f 3
3 12 b f 3
Or use query
for more readable query-like string.
In [33]: tempDF.query(' & '.join(['{0}=={1}'.format(k, repr(v)) for k, v in tempDict.iteritems()]))
Out[33]:
var1 var2 var3 var4
2 12 b f 3
3 12 b f 3
In [34]: ' & '.join(['{0}=={1}'.format(k, repr(v)) for k, v in tempDict.iteritems()])
Out[34]: "var4==3 & var2=='b'"
Upvotes: 1
Reputation: 31682
You could create mask for each condition using list comprehension and then join them by converting to dataframe and using all
:
In [23]: pd.DataFrame([tempDF[key] == val for key, val in tempDict.items()]).T.all(axis=1)
Out[23]:
0 False
1 False
2 True
3 True
4 False
5 False
6 False
7 False
8 False
9 False
dtype: bool
Then you could slice your dataframe with that mask:
mask = pd.DataFrame([tempDF[key] == val for key, val in tempDict.items()]).T.all(axis=1)
In [25]: tempDF[mask]
Out[25]:
var1 var2 var3 var4
2 12 b f 3
3 12 b f 3
Upvotes: 2