Reputation: 5498
import pandas as pd
import numpy as np
rng = pd.date_range('1/1/2011', periods=6, freq='H')
df = pd.DataFrame({'A': [0, 1, 2, 3, 4,5],
'B': [0, 1, 2, 3, 4,5],
'C': [0, 1, 2, 3, 4,5],
'D': [0, 1, 2, 3, 4,5],
'E': [1, 2, 3, 3, 7,6],
'F': [1, 1, 3, 3, 7,6],
'G': [0, 0, 1, 0, 0,0]
},
index=rng)
A simple dataframe to help me explain:
df
A B C D E F G
2011-01-01 00:00:00 0 0 0 0 1 1 0
2011-01-01 01:00:00 1 1 1 1 2 1 0
2011-01-01 02:00:00 2 2 2 2 3 3 1
2011-01-01 03:00:00 3 3 3 3 3 3 0
2011-01-01 04:00:00 4 4 4 4 7 7 0
2011-01-01 05:00:00 5 5 5 5 6 6 0
When I filter for a value greater than 2 I get the following output:
df[df >= 2]
A B C D E F G
2011-01-01 00:00:00 NaN NaN NaN NaN NaN NaN NaN
2011-01-01 01:00:00 NaN NaN NaN NaN 2.0 NaN NaN
2011-01-01 02:00:00 2.0 2.0 2.0 2.0 3.0 3.0 NaN
2011-01-01 03:00:00 3.0 3.0 3.0 3.0 3.0 3.0 NaN
2011-01-01 04:00:00 4.0 4.0 4.0 4.0 7.0 7.0 NaN
2011-01-01 05:00:00 5.0 5.0 5.0 5.0 6.0 6.0 NaN
For each row I want to know which column has the matching value first (working from left to right). So on the row for 2011-01-01 01:00:00
it was row E ands the value was 2.0.
Desired output:
What I would like to get is a new dataframe with the first match value in a column named 'Value' and another column called "From Col" which captures the column name this came from.
If no match seen then the output from the last column (G in this case). Thanks for any help.
"Value" "From Col"
2011-01-01 00:00:00 NaN G
2011-01-01 01:00:00 2 E
2011-01-01 02:00:00 2 A
2011-01-01 03:00:00 3 A
2011-01-01 04:00:00 4 A
2011-01-01 05:00:00 5 A
Upvotes: 3
Views: 744
Reputation: 502
I don't work with pandas, so this can be considered just as a footnote, but in pure python there is also possibility to find first non-None
index using reduce
.
>>> a
[None, None, None, None, 6, None, None, None, 3, None]
>>> print( reduce(lambda x, y: (x or y[1] and y[0]), enumerate(a), None))
4
Upvotes: 0
Reputation: 210922
Try this:
def get_first_valid(ser):
if len(ser) == 0:
return pd.Series([np.nan,np.nan])
mask = pd.isnull(ser.values)
i = mask.argmin()
if mask[i]:
return pd.Series([np.nan, ser.index[-1]])
else:
return pd.Series([ser[i], ser.index[i]])
In [113]: df[df >= 2].apply(get_first_valid, axis=1)
Out[113]:
0 1
2011-01-01 00:00:00 NaN G
2011-01-01 01:00:00 2.0 E
2011-01-01 02:00:00 2.0 A
2011-01-01 03:00:00 3.0 A
2011-01-01 04:00:00 4.0 A
2011-01-01 05:00:00 5.0 A
or:
In [114]: df[df >= 2].T.apply(get_first_valid).T
Out[114]:
0 1
2011-01-01 00:00:00 NaN G
2011-01-01 01:00:00 2 E
2011-01-01 02:00:00 2 A
2011-01-01 03:00:00 3 A
2011-01-01 04:00:00 4 A
2011-01-01 05:00:00 5 A
PS i took a source code of the Series.first_valid_index()
function and made a dirty hack out of it...
Explanation:
In [221]: ser = pd.Series([np.nan, np.nan, 5, 7, np.nan])
In [222]: ser
Out[222]:
0 NaN
1 NaN
2 5.0
3 7.0
4 NaN
dtype: float64
In [223]: mask = pd.isnull(ser.values)
In [224]: mask
Out[224]: array([ True, True, False, False, True], dtype=bool)
In [225]: i = mask.argmin()
In [226]: i
Out[226]: 2
In [227]: ser.index[i]
Out[227]: 2
In [228]: ser[i]
Out[228]: 5.0
Upvotes: 1
Reputation: 29719
Firstly, filter values according to criterion and drop the row containing all NaNs
. Then, use idxmax
to return the first occurence of a True
condition. This resembles our first series.
To create the second series, iterate over (index, value) tuple pairs of the first series and simultaneously append these locations present in the original DF
.
ser1 = (df[df.ge(2)].dropna(how='all').ge(2)).idxmax(1)
ser2 = pd.concat([pd.Series(df.loc[i,r], pd.Index([i])) for i, r in ser1.iteritems()])
Create a new DF
whose index pertains to that of the original DF
and fill the missing values in From Col with that of it's last column name.
req_df = pd.DataFrame({"From Col": ser1, "Value": ser2}, index=df.index)
req_df['From Col'].fillna(df.columns[-1], inplace=True)
req_df
Upvotes: 1