Reputation: 21
I have the following table:
Option 1 | Option 2 | Option 3 |
---|---|---|
A | X | 1 |
B | X | 1 |
A | Y | 1 |
C | X | 1 |
B | Y | 1 |
I need to split the values of each option on different columns based on the translation as follows: Option 1 values A or B becomes opt1000 with values opt1000-A and opt1000-B Option 1 value C becomes opt1001 with value opt1001-C Etc. as shown below:
Opt1000 | Opt1001 | Opt2000 | Opt2001 | Opt3000 |
---|---|---|---|---|
opt1000-A | NO-opt1001 | opt2000-X | NO-opt2001 | 1 |
opt1000-B | NO-opt1001 | opt2000-X | NO-opt2001 | 1 |
opt1000-A | NO-opt1001 | NO-opt2000 | opt-2001-y | 1 |
NO-opt1000 | opt1001-C | opt2000-X | NO-opt2001 | 1 |
opt-1000-B | NO-opt1001 | NO-opt2000 | opt2001-Y | 1 |
I have a translation file as follows:
Ancient Option name | Ancient Value | New option name | New value |
---|---|---|---|
Option 1 | A | Opt1000 | opt1000-A |
Option 1 | B | Opt1000 | opt1000-B |
Option 1 | C | Opt1001 | opt1001-C |
Option 2 | X | opt2000 | opt2000-X |
Option 2 | Y | opt2001 | opt2001-Y |
Does anyone have an idea of how I could automate this to be performant? (My table has 20 columns and a 1000 lines).
I was thinking of first extracting unique option values to create a new dataframe with the new column names. (By checking the new option name in the translation table and adding the name to a set). Then filling the data frame with 0 (for example, or NaN), then placing the new values one by one in the new dataframe by scanning each value in the original dataframe. When all values are translated, the remaining 0 or NaN will bere placed by NO_new-option-name_.
Let me know if it is clear or need more details.
Upvotes: 2
Views: 49
Reputation: 294258
The gist of this answer is that I'm going to go column by column and create dummy variables. I'll get a little fancy in how I fill out the text with Numpy array slicing.
from functools import reduce
import pandas as pd
import numpy as np
translator = {'A': '000', 'B': '000', 'C': '001',
'X': '000', 'Y': '001', 1: '000', '1': '000'}
optifier = lambda name: name.replace('Option ', 'opt')
df_ = df.rename(columns=optifier)
def f(s):
col = s.name
translated = s.map(translator)
mask = pd.get_dummies(translated).eq(1)
a = np.empty(mask.shape, object)
i0, j0 = np.where(mask) # Indices where to drop option text
i1, j1 = np.where(~mask) # Indices where to drop absence of option
a[i0, j0] = col+translated+'-'+s.astype(str)
cols = col + mask.columns
a[i1, j1] = 'NO-'+cols[j1]
return pd.DataFrame(a, mask.index, cols.str.title())
reduce(pd.DataFrame.join, map(f, map(df_.get, df_)))
Opt1000 Opt1001 Opt2000 Opt2001 Opt3000
0 opt1000-A NO-opt1001 opt2000-X NO-opt2001 opt3000-1
1 opt1000-B NO-opt1001 opt2000-X NO-opt2001 opt3000-1
2 opt1000-A NO-opt1001 NO-opt2000 opt2001-Y opt3000-1
3 NO-opt1000 opt1001-C opt2000-X NO-opt2001 opt3000-1
4 opt1000-B NO-opt1001 NO-opt2000 opt2001-Y opt3000-1
Upvotes: 1