Hobbit36
Hobbit36

Reputation: 275

Creating new columns in a DataFrame

I have a DataFrame with a few columns:

   'a'  'b'  'c'  'd'
0  'x'   3    3    5
1  'y'   2    3    6
2  'z'   1    4    1

I want to create a few new columns which depend on the data. For each possible value in column 'a' I want two new columns (I have a list of all the different values in column 'a'. There are only a few). There are two conditions for each column: For the first new column, column 'a' needs to equal the desired value (such as 'x') and column 'b' equals column 'c'. For the second new one, column 'a' still needs to equal the desired value but column 'b' needs to equals column 'd' (column 'b' will either equal 'c' or 'd'). If both conditions are met the new column would get a 1, if not it'll get a 0.

Here's how it'll look with the above example DataFrame, given that:

a. The desired value for columns 'e' & 'f' are 'x'

b. The desired value for columns 'g' & 'h' are 'y'

c. The desired value for columns 'j' & 'k' are 'z'

d. Columns 'e', 'g', 'h' are when columns 'b' and 'c' are equal

e. Columns 'f', 'h', 'k' are when columns 'b' and 'd' are equal

   'a'  'b'  'c'  'd'  'e'  'f'  'g'  'h'  'j'  'k'
0  'x'   3    3    5    1    0    0    0    0    0
1  'y'   2    3    6    0    0    0    0    0    0 
2  'z'   1    4    1    0    0    0    0    0    1

I tried using the apply function for each example. Here's for when we want to test for 'x' and that column 'b' and 'c' are equal:

data['d']= data.apply(lambda row: assignEvent(row, 'x', row['c']), axis=1 )

With the assignEvent function here:

def assignEvent(row, event, venue):
    """
    :param event: the desired event we're looking for
    :param venue: Either column 'c' or 'd' 
    """

    if (str(row['a'])==event) & (str(venue)==str(row['b'])):
            return 1
    else:
            return 0

It's not working though as when I'm done all the values in the new columns are 0. I'm not sure why though because I tested it and I know that I'm getting into the if statement in my function.

Upvotes: 3

Views: 569

Answers (2)

EFT
EFT

Reputation: 2369

Method:

I'm going to present an approach that avoids using apply, for better speed & scaling. It appears that you're essentially aiming to add columns giving two different sets of indicator variables for the entries in data['a'] depending on the condition you outline in your question. If this is incorrect, and only a subset of the values of column a should get indicators, see addendum.

Getting indicator variables is straightforward:

dummies = pd.get_dummies(data['a'])
dummies
Out[335]: 
   'x'  'y'  'z'
0    1    0    0
1    0    1    0
2    0    0    1

Identifying rows where the conditions are true is also easy, shown here using numpy.where:

np.where(data['b'] == data['c'], 1, 0)

To combine these, we can use matrix multiplication, after playing with the formatting of the np.where output a bit:

np.array([np.where(data['b'] == data['c'], 1, 0)]).T*dummies
Out[338]: 
   'x'  'y'  'z'
0    1    0    0
1    0    0    0
2    0    0    0

To do this for both conditions, join it with the original data, and format it how you specified, I'll skip ahead to the following:

def col_a_dummies(data):
    dummies = pd.get_dummies(data['a'])
    b_c = np.array([np.where(data['b'] == data['c'], 1, 0)]).T*dummies
    b_d = np.array([np.where(data['b'] == data['d'], 1, 0)]).T*dummies
    return pd.concat([data[['a', 'b', 'c', 'd']], b_c, b_d], axis=1)

def format_dummies(dummies):
    dummies.columns = ['a', 'b', 'c', 'd', 'e', 'g', 'j', 'f', 'h', 'k']
    return dummies.sort_index(axis=1)

data = format_dummies(col_a_dummies(data))
data
Out[362]: 
     a  b  c  d  e  f  g  h  j  k
0  'x'  3  3  5  1  0  0  0  0  0
1  'y'  2  3  6  0  0  0  0  0  0
2  'z'  1  4  1  0  0  0  0  0  1

Addendum: This method still largely works if the dataframe is first filtered before being fed to get_dummies. This does introduce the additional constraint of needing a unique index on data.

def filtered_col_a_dummies(data, values):
    filtered = data[data['a'].isin(values)]
    dummies = pd.get_dummies(filtered['a'])
    b_c = np.array([np.where(filtered['b'] == filtered['c'], 1, 0)]).T*dummies
    b_d = np.array([np.where(filtered['b'] == filtered['d'], 1, 0)]).T*dummies
    return pd.concat([data[['a', 'b', 'c', 'd']], b_c, b_d], axis=1).fillna(0)

%timeit Results

On three rows this is already faster:

def assignEvent(row, event, venue):
    """
    :param event: the desired event we're looking for
    :param venue: Either column 'c' or 'd' 
    """

    if (row['a']==event) & (row[venue]==row['b']):
            return 1
    else:
            return 0

def no_sort_format_dummies(dummies):
    dummies.columns = ['a', 'b', 'c', 'd', 'e', 'g', 'j', 'f', 'h', 'k']
    return dummies

%timeit data.apply(lambda row: assignEvent(row, "'x'", 'c'), axis=1)
1000 loops, best of 3: 467 µs per loop
# needs to be repeated six times in total, total time 2.80 ms, ignoring assignment

%timeit format_dummies(col_a_dummies(data))
100 loops, best of 3: 2.58 ms per loop

or

%timeit no_sort_format_dummies(col_a_dummies(data))
100 loops, best of 3: 2.07 ms per loop

if not sorting the columns.

If filtered:

%timeit format_dummies(filtered_col_a_dummies(data, ("'x'", "'y'", "'z'")))
100 loops, best of 3: 3.92 ms per loop

On 300 rows it becomes more pronounced:

%timeit data.apply(lambda row: assignEvent(row, "'x'", 'c'), axis=1)
100 loops, best of 3: 10.9 ms per loop

%timeit format_dummies(col_a_dummies(data))
100 loops, best of 3: 2.73 ms per loop

%timeit no_sort_format_dummies(col_a_dummies(data))
100 loops, best of 3: 2.14 ms per loop

%timeit format_dummies(filtered_col_a_dummies(data, ("'x'", "'y'", "'z'")))
100 loops, best of 3: 4.04 ms per loop

Upvotes: 1

Scott Boston
Scott Boston

Reputation: 153460

I changed a couple of things. First off, your data for column a has quotes, so I strip those with replace in the assignEvent function. Secondly, I pass only the column name as a parameter for venue instead, let's access that column in the function.

def assignEvent(row, event, venue):
    """
    :param event: the desired event we're looking for
    :param venue: Either column 'c' or 'd' 
    """

    if (row['a'].replace("'","")==event) & (row[venue]==row['b']):
            return 1
    else:
            return 0

df['dd']= df.apply(lambda row: assignEvent(row, 'x', 'c'), axis=1 )

Output:

     a  b  c  d  dd
0  'x'  3  3  5   1
1  'y'  2  3  6   0
2  'z'  1  4  1   0

Upvotes: 1

Related Questions