nipy
nipy

Reputation: 5498

Return first matching value/column name in new dataframe

import pandas as pd
import numpy as np
rng = pd.date_range('1/1/2011', periods=6, freq='H')
df = pd.DataFrame({'A': [0, 1, 2, 3, 4,5],
                   'B': [0, 1, 2, 3, 4,5],
                   'C': [0, 1, 2, 3, 4,5],
                   'D': [0, 1, 2, 3, 4,5],
                   'E': [1, 2, 3, 3, 7,6],
                   'F': [1, 1, 3, 3, 7,6],
                   'G': [0, 0, 1, 0, 0,0]

                  },
                 index=rng)

A simple dataframe to help me explain:

df


                    A   B   C   D   E   F   G
2011-01-01 00:00:00 0   0   0   0   1   1   0
2011-01-01 01:00:00 1   1   1   1   2   1   0
2011-01-01 02:00:00 2   2   2   2   3   3   1
2011-01-01 03:00:00 3   3   3   3   3   3   0
2011-01-01 04:00:00 4   4   4   4   7   7   0
2011-01-01 05:00:00 5   5   5   5   6   6   0

When I filter for a value greater than 2 I get the following output:

df[df >= 2]

                     A  B   C   D   E   F   G
2011-01-01 00:00:00 NaN NaN NaN NaN NaN NaN NaN
2011-01-01 01:00:00 NaN NaN NaN NaN 2.0 NaN NaN
2011-01-01 02:00:00 2.0 2.0 2.0 2.0 3.0 3.0 NaN
2011-01-01 03:00:00 3.0 3.0 3.0 3.0 3.0 3.0 NaN
2011-01-01 04:00:00 4.0 4.0 4.0 4.0 7.0 7.0 NaN
2011-01-01 05:00:00 5.0 5.0 5.0 5.0 6.0 6.0 NaN

For each row I want to know which column has the matching value first (working from left to right). So on the row for 2011-01-01 01:00:00 it was row E ands the value was 2.0.

enter image description here

Desired output:

What I would like to get is a new dataframe with the first match value in a column named 'Value' and another column called "From Col" which captures the column name this came from.

If no match seen then the output from the last column (G in this case). Thanks for any help.

                       "Value" "From Col"   
    2011-01-01 00:00:00    NaN  G
    2011-01-01 01:00:00    2    E
    2011-01-01 02:00:00    2    A
    2011-01-01 03:00:00    3    A
    2011-01-01 04:00:00    4    A
    2011-01-01 05:00:00    5    A

Upvotes: 3

Views: 744

Answers (3)

Juraj Bezručka
Juraj Bezručka

Reputation: 502

I don't work with pandas, so this can be considered just as a footnote, but in pure python there is also possibility to find first non-None index using reduce.

>>> a
[None, None, None, None, 6, None, None, None, 3, None]

>>> print( reduce(lambda x, y: (x or y[1] and y[0]), enumerate(a), None))
4

Upvotes: 0

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210922

Try this:

def get_first_valid(ser):
    if len(ser) == 0:
        return pd.Series([np.nan,np.nan])

    mask = pd.isnull(ser.values)
    i = mask.argmin()
    if mask[i]:
        return pd.Series([np.nan, ser.index[-1]])
    else:
        return pd.Series([ser[i], ser.index[i]])


In [113]: df[df >= 2].apply(get_first_valid, axis=1)
Out[113]:
                       0  1
2011-01-01 00:00:00  NaN  G
2011-01-01 01:00:00  2.0  E
2011-01-01 02:00:00  2.0  A
2011-01-01 03:00:00  3.0  A
2011-01-01 04:00:00  4.0  A
2011-01-01 05:00:00  5.0  A

or:

In [114]: df[df >= 2].T.apply(get_first_valid).T
Out[114]:
                       0  1
2011-01-01 00:00:00  NaN  G
2011-01-01 01:00:00    2  E
2011-01-01 02:00:00    2  A
2011-01-01 03:00:00    3  A
2011-01-01 04:00:00    4  A
2011-01-01 05:00:00    5  A

PS i took a source code of the Series.first_valid_index() function and made a dirty hack out of it...

Explanation:

In [221]: ser = pd.Series([np.nan, np.nan, 5, 7, np.nan])

In [222]: ser
Out[222]:
0    NaN
1    NaN
2    5.0
3    7.0
4    NaN
dtype: float64

In [223]: mask = pd.isnull(ser.values)

In [224]: mask
Out[224]: array([ True,  True, False, False,  True], dtype=bool)

In [225]: i = mask.argmin()

In [226]: i
Out[226]: 2

In [227]: ser.index[i]
Out[227]: 2

In [228]: ser[i]
Out[228]: 5.0

Upvotes: 1

Nickil Maveli
Nickil Maveli

Reputation: 29719

Firstly, filter values according to criterion and drop the row containing all NaNs. Then, use idxmax to return the first occurence of a True condition. This resembles our first series.

To create the second series, iterate over (index, value) tuple pairs of the first series and simultaneously append these locations present in the original DF.

ser1 = (df[df.ge(2)].dropna(how='all').ge(2)).idxmax(1)
ser2 = pd.concat([pd.Series(df.loc[i,r], pd.Index([i])) for i, r in ser1.iteritems()])

Create a new DF whose index pertains to that of the original DF and fill the missing values in From Col with that of it's last column name.

req_df = pd.DataFrame({"From Col": ser1, "Value": ser2}, index=df.index)
req_df['From Col'].fillna(df.columns[-1], inplace=True)
req_df

enter image description here

Upvotes: 1

Related Questions