Reputation: 4511
I have a dataframe
0 1 2 3 Marketcap
0 1.707280 0.666952 0.638515 -0.061126 2.291747 1.71B
1 -1.017134 1.353627 0.618433 0.008279 0.148128 1.82B
2 -0.774057 -0.165566 -0.083345 0.741598 -0.139851 1.1M
3 -0.630724 0.250737 1.308556 -1.040799 1.064456 30.92M
4 2.029370 0.899612 0.261146 1.474148 -1.663970 476.74k
5 2.029370 0.899612 0.261146 1.474148 -1.663970 -1
Is there some sort of custom filter method, that would let Python know B > M > K?
Say I want to filter, df[df.Marketcap > 35.00M]
, is there a clever or clean way to do this?
Having the M or B makes the value very readable and easy to differentiate.
Thank you.
EDIT: Reopened the thread as Max U's answer while excellent seems to produce a pandas bug, which we opened an issue on Github.
Upvotes: 3
Views: 1037
Reputation: 210812
Source DF:
In [176]: df
Out[176]:
0 1 2 3 Market Cap
0 1.707280 0.666952 0.638515 -0.061126 2.291747 1.71B
1 -1.017134 1.353627 0.618433 0.008279 0.148128 1.82B
2 -0.774057 -0.165566 -0.083345 0.741598 -0.139851 1.1M
3 -0.630724 0.250737 1.308556 -1.040799 1.064456 30.92M
4 2.029370 0.899612 0.261146 1.474148 -1.663970 476.74k
5 2.029370 0.899612 0.261146 1.474148 -1.663970 -1
Solution:
to_replace = ['\d+\s*[Kk]','\d+\s*[Mm]','\d+\s*[Bb]', '-1', 'N/A']
value = [1000,1000000,1000000000, 1, 1]
mask = df.assign(
f=df['Market Cap'].replace(to_replace, value, regex=True),
Marketcap=pd.to_numeric(df['Market Cap'].str.replace(r'[^\d\.]', ''), errors='coerce')
).eval("Marketcap * f < 35000000")
df[mask]
Result:
In [178]: df[mask]
Out[178]:
0 1 2 3 Market Cap
2 -0.774057 -0.165566 -0.083345 0.741598 -0.139851 1.1M
3 -0.630724 0.250737 1.308556 -1.040799 1.064456 30.92M
4 2.029370 0.899612 0.261146 1.474148 -1.663970 476.74k
5 2.029370 0.899612 0.261146 1.474148 -1.663970 -1
PS if you want to leave non-numeric values (like N/A
) in the resulting data set change:
pd.to_numeric(df['Market Cap'].str.replace(r'[^\d\.]', ''), errors='coerce')
to
pd.to_numeric(df['Market Cap'].str.replace(r'[^\d\.]', ''), errors='coerce').fillna('0')
Upvotes: 2
Reputation: 210812
UPDATE:
In [44]: df
Out[44]:
0 1 2 3 4 Marketcap
0 1.707280 0.666952 0.638515 -0.061126 2.291747 1.71B
1 -1.017134 1.353627 0.618433 0.008279 0.148128 1.82B
2 -0.774057 -0.165566 -0.083345 0.741598 -0.139851 1.1M
3 -0.630724 0.250737 1.308556 -1.040799 1.064456 30.92M
4 2.029370 0.899612 0.261146 1.474148 -1.663970 476.74k
5 2.029370 0.899612 0.261146 1.474148 -1.663970 -1
In [45]: df[pd.eval(df.Marketcap.replace(['[Kk]','[Mm]','[Bb]'],
['*10**3','*10**6','*10**9'], regex=True) \
.add(' < 35*10**6'))]
Out[45]:
0 1 2 3 4 Marketcap
2 -0.774057 -0.165566 -0.083345 0.741598 -0.139851 1.1M
3 -0.630724 0.250737 1.308556 -1.040799 1.064456 30.92M
4 2.029370 0.899612 0.261146 1.474148 -1.663970 476.74k
5 2.029370 0.899612 0.261146 1.474148 -1.663970 -1
I'd do it this way:
In [13]: df[pd.eval(df.Marketcap.replace(['M','B'],['','*1000'], regex=True).add(' > 35'))]
Out[13]:
0 1 2 3 4 Marketcap
0 1.707280 0.666952 0.638515 -0.061126 2.291747 1.71B
1 -1.017134 1.353627 0.618433 0.008279 0.148128 1.82B
4 2.029370 0.899612 0.261146 1.474148 -1.663970 100.9M
Explanation:
In [14]: df.Marketcap.replace(['M','B'],['','*1000'], regex=True)
Out[14]:
0 1.71*1000
1 1.82*1000
2 1.1
3 30.92
4 100.9
Name: Marketcap, dtype: object
In [15]: df.Marketcap.replace(['M','B'],['','*1000'], regex=True).add(' > 35')
Out[15]:
0 1.71*1000 > 35
1 1.82*1000 > 35
2 1.1 > 35
3 30.92 > 35
4 100.9 > 35
Name: Marketcap, dtype: object
In [16]: pd.eval(df.Marketcap.replace(['M','B'],['','*1000'], regex=True).add(' > 35'))
Out[16]: array([True, True, False, False, True], dtype=object)
Upvotes: 2
Reputation: 8946
This isn't super clean, but it does the trick and doesn't use any python iteration:
Code:
# Create a separate column (which you can omit later) that converts 'Marketcap' strings to numbers
df['cap'] = df.loc[df['Marketcap'].str.contains('B'), 'Marketcap'].str.replace('B','').astype(float) * 1000
df['cap'].fillna(df.loc[df['Marketcap'].str.contains('M'), 'Marketcap'].str.replace('M',''), inplace = True)
# For pandas pre-0.20.0 (<May 2017)
print df.ix[df['cap'].astype(float) > 35, :-1]
# For pandas 0.20.0+ (.ix[] deprecated)
print df.iloc[df[df['cap'].astype(float) > 35].index, :-1]
# Or, alternate pandas 0.20.0+ option (thanks @Psidom)
print df[df['cap'].astype(float) > 35].iloc[:,:-1]
Output:
0 1 2 3 4 Marketcap
0 1.707280 0.666952 0.638515 -0.061126 2.291747 1.71B
1 -1.017134 1.353627 0.618433 0.008279 0.148128 1.82B
4 2.029370 0.899612 0.261146 1.474148 -1.663970 100.9M
Upvotes: 3