jonboy
jonboy

Reputation: 368

Conditional generation of new column - Pandas

I am trying to create a new column based on conditional logic on pre-existing columns. I understand there may be more efficient ways to achieve this but I have a few conditions that need to be included. This is just the first step.

The overall scope is to create two new columns that are mapped from 1 and 2. These are referenced to the Object column as I can have multiple rows for each time point.

Object2 and Value determine how to map the new columns. So if Value is == X, I want to match both Object columns to return the corresponding 1 and 2 for that time point to a new column. The same process should occur if Value is == Y. If Value is == Z, I want to insert 0, 0. Everything else should be NaN

df = pd.DataFrame({   
        'Time' : ['2019-08-02 09:50:10.1','2019-08-02 09:50:10.1','2019-08-02 09:50:10.2','2019-08-02 09:50:10.3','2019-08-02 09:50:10.3','2019-08-02 09:50:10.4','2019-08-02 09:50:10.5','2019-08-02 09:50:10.6','2019-08-02 09:50:10.6'],
        'Object' : ['B','A','A','A','C','C','C','B','B'],
        '1' : [1,3,5,7,9,11,13,15,17],  
        '2' : [0,1,4,6,8,10,12,14,16],     
        'Object2' : ['A','A',np.nan,'C','C','C','C','B','A'],                 
        'Value' : ['X','X',np.nan,'Y','Y','Y','Y','Z',np.nan],                  
        })

def map_12(df):

for i in df['Value']:
    if i == 'X':
        df['A1'] = df['1']
        df['A2'] = df['2']
    elif i == 'Y':
        df['A1'] = df['1']
        df['A2'] = df['2']     
    elif i == 'Z':
        df['A1'] = 0
        df['A2'] = 0             
    else:
        df['A1'] = np.nan
        df['A2'] = np.nan              

return df

Intended Output:

                    Time Object   1   2 Object2 Value    A1    A2
0  2019-08-02 09:50:10.1      A   1   0       A     X   1.0   0.0 # Match A-A at this time point, so output is 1,0
1  2019-08-02 09:50:10.1      B   3   1       A     X   1.0   0.0 # Still at same time point so use 1,0 
2  2019-08-02 09:50:10.2      A   5   4     NaN   NaN   NaN   NaN # No Value so NaN
3  2019-08-02 09:50:10.3      C   7   6       C     Y   7.0   6.0 # Match C-C at this time point, so output is 7,6
4  2019-08-02 09:50:10.3      A   9   8       C     Y   7.0   6.0 # Still at same time point so use 7,6 
5  2019-08-02 09:50:10.4      C  11  10       C     Y  11.0  10.0 # Match C-C at this time point, so output is 11,10
6  2019-08-02 09:50:10.5      C  13  12       C     Y  13.0  12.0 # Match C-C at this time point, so output is 13,12
7  2019-08-02 09:50:10.6      B  15  14       B     Z   0.0   0.0 # Z so 0,0
8  2019-08-02 09:50:10.6      B  17  16       A   NaN   NaN   NaN # No Value so NaN

New sample df:

 df = pd.DataFrame({   
        'Time' : ['2019-08-02 09:50:10.1','2019-08-02 09:50:10.1','2019-08-02 09:50:10.2','2019-08-02 09:50:10.3','2019-08-02 09:50:10.3','2019-08-02 09:50:10.4','2019-08-02 09:50:10.5','2019-08-02 09:50:10.6','2019-08-02 09:50:10.6'],
        'Object' : ['B','A','A','A','C','C','C','B','B'],
        '1' : [1,3,5,7,9,11,13,15,17],  
        '2' : [0,1,4,6,8,10,12,14,16],     
        'Object2' : ['A','A',np.nan,'C','C','C','C','B','A'],                 
        'Value' : ['X','X',np.nan,'Y','Y','Y','Y','Z',np.nan],                
        })

Intended Output:

                    Time Object   1   2 Object2 Value    A1    A2
0  2019-08-02 09:50:10.1      B   1   0       A     X   3.0   1.0 # Match A-A at this time point, so output is 3,1
1  2019-08-02 09:50:10.1      A   3   1       A     X   3.0   1.0 # Still at same time point so use 3,1 
2  2019-08-02 09:50:10.2      A   5   4     NaN   NaN   NaN   NaN # No Value so NaN
3  2019-08-02 09:50:10.3      A   7   6       C     Y   9.0   8.0 # Match C-C at this time point, so output is 9,8
4  2019-08-02 09:50:10.3      C   9   8       C     Y   9.0   8.0 # Still at same time point so use 9,8 
5  2019-08-02 09:50:10.4      C  11  10       C     Y  11.0  10.0 # Match C-C at this time point, so output is 11,10
6  2019-08-02 09:50:10.5      C  13  12       C     Y  13.0  12.0 # Match C-C at this time point, so output is 13,12
7  2019-08-02 09:50:10.6      B  15  14       B     Z   0.0   0.0 # Z so 0,0
8  2019-08-02 09:50:10.6      B  17  16       A   NaN   NaN   NaN # No Value so NaN

Upvotes: 4

Views: 362

Answers (4)

ansev
ansev

Reputation: 30920

Use DataFrame.where + DataFrame.eq to create a DataFrame similar to df[['1','2']] but only with the rows where matches is True and the rest with NaN. Then group by time points using DataFrame.groupby and fill in the missing data of each group with the existing values ​​where Object and Object2 (matches==True) coincide. Use DataFrame.where to discart values where df['Value'] is NaN.Finally use [DataFrame.mask] to set 0 when Z is in the column Value

#matches
matches=df.Object.eq(df.Object2)
#Creating conditions
condition_z=df['Value']=='Z'
not_null=df['Value'].notnull()
#Creating DataFrame to fill
df12=( df[['1','2']].where(matches)
                    .groupby(df['Time'],sort=False)
                    .apply(lambda x: x.ffill().bfill()) )
#fill 0 on Value is Z and discarting NaN
df[['A1','A2']] =df12.where(not_null).mask(condition_z,0)
print(df)

Output

                    Time Object   1   2 Object2 Value    A1    A2
0  2019-08-02 09:50:10.1      B   1   0       A     X   3.0   1.0
1  2019-08-02 09:50:10.1      A   3   1       A     X   3.0   1.0
2  2019-08-02 09:50:10.2      A   5   4     NaN   NaN   NaN   NaN
3  2019-08-02 09:50:10.3      A   7   6       C     Y   9.0   8.0
4  2019-08-02 09:50:10.3      C   9   8       C     Y   9.0   8.0
5  2019-08-02 09:50:10.4      C  11  10       C     Y  11.0  10.0
6  2019-08-02 09:50:10.5      C  13  12       C     Y  13.0  12.0
7  2019-08-02 09:50:10.6      B  15  14       B     Z   0.0   0.0
8  2019-08-02 09:50:10.6      B  17  16       A   NaN   NaN   NaN

We can also use GroupBy.transform:

#matches
matches=df.Object.eq(df.Object2)
#Creating conditions
condition_z=df['Value']=='Z'
not_null=df['Value'].notnull()
#Creating DataFrame to fill
df12=( df[['1','2']].where(matches)
                    .groupby(df['Time'],sort=False)
                    .transform('first') )
#fill 0 on Value is Z and discarting NaN
df[['A1','A2']] =df12.where(not_null).mask(condition_z,0)
print(df)

Upvotes: 2

jezrael
jezrael

Reputation: 862751

If only few conditions use DataFrame.loc for assign values by condition:

m1 = df['Value'].isin(['X','Y'])
m2 = df['Value'] == 'Z'

df[['A1','A2']] = df.loc[m1, ['1','2']]
df.loc[m2, ['A1','A2']] = 0
print(df)
                    Time Object   1   2 Object2 Value   A1   A2
0  2019-08-02 09:50:10.1      A   1   0       A     X  1.0  0.0
1  2019-08-02 09:50:10.1      B   1   1       A     X  1.0  1.0
2  2019-08-02 09:50:10.2      A   5   4     NaN   NaN  NaN  NaN
3  2019-08-02 09:50:10.3      C   7   6       C     Y  7.0  6.0
4  2019-08-02 09:50:10.3      A   9   8       C     Y  9.0  8.0
5  2019-08-02 09:50:10.4      C  11  10     NaN   NaN  NaN  NaN
6  2019-08-02 09:50:10.5      C  13  12       B   NaN  NaN  NaN
7  2019-08-02 09:50:10.6      B  15  14       B     Z  0.0  0.0
8  2019-08-02 09:50:10.6      B  17  16       B   NaN  NaN  NaN

Another solution with numpy.select and broadcasting of masks:

m1 = df['Value'].isin(['X','Y'])
m2 = df['Value'] == 'Z'

masks = [m1.values[:, None], m2.values[:, None]]
values = [df[['1','2']].values, 0]

df[['A1','A2']] = pd.DataFrame(np.select(masks,values, default=np.nan), index=df.index)
print(df)
                    Time Object   1   2 Object2 Value   A1   A2
0  2019-08-02 09:50:10.1      A   1   0       A     X  1.0  0.0
1  2019-08-02 09:50:10.1      B   1   1       A     X  1.0  1.0
2  2019-08-02 09:50:10.2      A   5   4     NaN   NaN  NaN  NaN
3  2019-08-02 09:50:10.3      C   7   6       C     Y  7.0  6.0
4  2019-08-02 09:50:10.3      A   9   8       C     Y  9.0  8.0
5  2019-08-02 09:50:10.4      C  11  10     NaN   NaN  NaN  NaN
6  2019-08-02 09:50:10.5      C  13  12       B   NaN  NaN  NaN
7  2019-08-02 09:50:10.6      B  15  14       B     Z  0.0  0.0
8  2019-08-02 09:50:10.6      B  17  16       B   NaN  NaN  NaN

Upvotes: 1

run-out
run-out

Reputation: 3184

I had to make a few adjustments to your dataframe, as it didn't match the desired result in your question.

df = pd.DataFrame(
    {
        "Time": [
            "2019-08-02 09:50:10.1",
            "2019-08-02 09:50:10.1",
            "2019-08-02 09:50:10.2",
            "2019-08-02 09:50:10.3",
            "2019-08-02 09:50:10.3",
            "2019-08-02 09:50:10.4",
            "2019-08-02 09:50:10.5",
            "2019-08-02 09:50:10.6",
            "2019-08-02 09:50:10.6",
        ],
        "Object": ["A", "B", "A", "C", "A", "C", "C", "B", "B"],
        "1": [1, 1, 5, 7, 9, 11, 13, 15, 17],
        "2": [0, 1, 4, 6, 8, 10, 12, 14, 16],
        "Object2": ["A", "A", np.nan, "C", "C", "C", "C", "B", "A"],
        "Value": ["X", "X", np.nan, "Y", "Y", "Y", "Y", "Z", np.nan],
    }
)

This is a vectorized solution that should perform well over large data.

First step is to make sure the dataframe is sorted by time.

df = df.sort_values("Time")

Copy columns 1 and 2

df["A1"] = df["1"]
df["A2"] = df["2"]

Going to use the index values to obtain the first row of each time group.

df = df.reset_index()

I'm not that happy with the list/isin solution. Curious if anyone knows a less hacky way to do this?

li = df.groupby("Time").index.first().tolist()

print(li)
[0, 2, 3, 5, 6, 7]

print(df)
   index                   Time Object   1   2 Object2 Value  A1  A2
0      0  2019-08-02 09:50:10.1      A   1   0       A     X   1   0
1      1  2019-08-02 09:50:10.1      B   1   1       A     X   1   1
2      2  2019-08-02 09:50:10.2      A   5   4     NaN   NaN   5   4
3      3  2019-08-02 09:50:10.3      C   7   6       C     Y   7   6
4      4  2019-08-02 09:50:10.3      A   9   8       C     Y   9   8
5      5  2019-08-02 09:50:10.4      C  11  10       C     Y  11  10
6      6  2019-08-02 09:50:10.5      C  13  12       C     Y  13  12
7      7  2019-08-02 09:50:10.6      B  15  14       B     Z  15  14
8      8  2019-08-02 09:50:10.6      B  17  16       A   NaN  17  16

Filter the dataframe to get all rows except the ones in the list then set them to np.NaN

df.loc[~df.index.isin(li), ["A1", "A2"]] = np.NaN

Fill forward the first row values.

df[["A1", "A2"]] = df[["A1", "A2"]].ffill(axis=0)

Set z to 0 and np.NaN to np.NaN

df.loc[df["Value"] == "Z", ["A1", "A2"]] = 0
df.loc[df["Value"].isnull(), ["A1", "A2"]] = np.NaN

Remove index column

df = df.drop("index", axis=1)

print(df)
                    Time Object   1   2 Object2 Value    A1    A2
0  2019-08-02 09:50:10.1      A   1   0       A     X   1.0   0.0
1  2019-08-02 09:50:10.1      B   1   1       A     X   1.0   0.0
2  2019-08-02 09:50:10.2      A   5   4     NaN   NaN   NaN   NaN
3  2019-08-02 09:50:10.3      C   7   6       C     Y   7.0   6.0
4  2019-08-02 09:50:10.3      A   9   8       C     Y   7.0   6.0
5  2019-08-02 09:50:10.4      C  11  10       C     Y  11.0  10.0
6  2019-08-02 09:50:10.5      C  13  12       C     Y  13.0  12.0
7  2019-08-02 09:50:10.6      B  15  14       B     Z   0.0   0.0
8  2019-08-02 09:50:10.6      B  17  16       A   NaN   NaN   NaN

Upvotes: 0

anshulk
anshulk

Reputation: 458

Have a look at Dataframe apply

df['A1'] = df.apply(lambda row: row['1'] if row['Value'] == 'X' else np.nan, axis=1)

Upvotes: 0

Related Questions