Dfeld
Dfeld

Reputation: 197

Pulling column values based on conditions

I have the following dataframe

df = pd.DataFrame({
    'Column_1': ['Position', 'Start', 'End', 'Position'], 
    'Original_1': ['Open', 'Barn', 'Grass', 'Bubble'], 
    'Latest_1': ['Shut', 'Horn', 'Date', 'Dinner'], 
    'Column_2': ['Start', 'Position', 'End', 'During'], 
    'Original_2': ['Sky', 'Hold', 'Car', 'House'], 
    'Latest_2': ['Pedal', 'Lap', 'Two', 'Force'], 
    'Column_3': ['Start', 'End', 'Position', 'During'], 
    'Original_3': ['Leave', 'Dog', 'Block', 'Hope'], 
    'Latest_3': ['Sear', 'Crawl', 'Enter', 'Night']
})

For every instance where the word Position is in 'Column_1', 'Column_2', or 'Column_3', I want to capture the associated values in 'Original_1', 'Original_2', 'Original_3' and assign them to the new column named 'Original_Values'.

The following code can accomplish that, but only on a column by column basis.

df['Original_Value1'] = df.loc[df['Column_1'] == 'Position', 'Original_1']
df['Original_Value2'] = df.loc[df['Column_2'] == 'Position', 'Original_2']
df['Original_Value3'] = df.loc[df['Column_3'] == 'Position', 'Original_3']

Is there a way to recreate the above code so that it iterates over the entire data frame (not by specified columns)?

I'm hoping to create one column ('Original_values') with the following result:

0      Open
1      Hold
2     Block
3    Bubble
Name: Original_Values, dtype: object

Upvotes: 1

Views: 83

Answers (3)

Anton vBR
Anton vBR

Reputation: 18916

How about creating a mask with the first 3 cols (or specify the name of them) and multiply it with the values in cols 6 to 9 (or specify the names of them). Then take max() value to remove nan.

df['Original_Values'] = ((df.iloc[:,:3] == 'Position') * df.iloc[:,6:9].values).max(1)

print(df['Original_values'])

Returns:

0      Open
1      Hold
2     Block
3    Bubble
Name: Original_Value, dtype: object

Upvotes: 2

ALollz
ALollz

Reputation: 59549

Here's a kinda silly way to do it with some stacking, which might perform better if you have a very large df and need to avoid axis=1.

  • Stack the first three columns to create a list of the index and which 'Original' column the value corresponds to
  • Stack the columns from which you want to get the values. Use the above list to reindex it, so you return the appropriate value.
  • Bring those values back to the original df based on the original row index.

Here's the code:

import re

mask_list = ['Column_1', 'Column_2', 'Column_3']
val_list = ['Original_1', 'Original_2', 'Original_3']

idx = df[mask_list].stack()[df[mask_list].stack() == 'Position'].index.tolist()
idx = [(x , re.sub('(.*_)', 'Original_', y)) for x, y in idx]

df['Original_Values'] = df[val_list].stack().reindex(idx).reset_index(level=1).drop(columns='level_1')

df is now:

   Column_1  Column_2  Column_3 ... Original_Values
0  Position     Start     Start ...            Open
1     Start  Position       End ...            Hold
2       End       End  Position ...           Block
3  Position    During    During ...          Bubble

If 'Position' is not found in any of the columns in mask_list, Original_Values becomes NaN for that row. If you need to scale it to more columns, simply add them to mask_list and val_list.

Upvotes: 0

gbrener
gbrener

Reputation: 5835

One way to do it, with df.apply():

def choose_orig(row):
    if row['Column_1'] == 'Position':
        return row['Original_1']
    elif row['Column_2'] == 'Position':
        return row['Original_2']
    elif row['Column_3'] == 'Position':
        return row['Original_3']
    return ''

df['Original_Values'] = df.apply(choose_orig, axis=1)

The axis=1 argument to df.apply() causes the choose_orig() function to be called once for each row of the dataframe.

Note that this uses a default value of the empty string, '', when none of the columns match the word 'Position'.

Upvotes: 2

Related Questions