Murchak
Murchak

Reputation: 195

Grouping Pandas dataframe based on conditions?

I am following the suggestions here pandas create new column based on values from other columns but still getting an error. Basically, my Pandas dataframe has many columns and I want to group the dataframe based on a new categorical column whose value depends on two existing columns (AMP, Time).

df
df['Time'] = pd.to_datetime(df['Time']) 
#making sure Time column read from the csv file is time object

import datetime as dt
day_1 = dt.date.today()
day_2 = dt.date.today() - dt.timedelta(days = 1)

def f(row):

    if (row['AMP'] > 100) & (row['Time'] > day_1):
        val = 'new_positives'

    elif (row['AMP'] > 100) & (day_2 <= row['Time'] <= day_1):
        val = 'rec_positives'

    elif (row['AMP'] > 100 & row['Time'] < day_2):
        val = 'old_positives'

    else:
        val = 'old_negatives'

    return val

df['GRP'] = df.apply(f, axis=1) #this gives the following error:
TypeError: ("Cannot compare type 'Timestamp' with type 'date'", 'occurred at index 0')

df[(df['AMP'] > 100) & (df['Time'] > day_1)]  #this works fine

df[(df['AMP'] > 100) & (day_2 <= df['Time'] <= day_1)]  #this works fine

df[(df['AMP'] > 100) & (df['Time'] < day_2)]  #this works fine


#df = df.groupby('GRP')  

I am able to select the proper sub-dataframes based on the conditions specified above, but when I apply the above function on each row, I get the error. What is the correct approach to group the dataframe based on the conditions listed?

EDIT:

Unforunately, I cannot provide a sample of my dataframe. However, here is simple dataframe that gives an error of the same type:

import numpy as np
import pandas as pd
mydf = pd.DataFrame({'a':np.arange(10),
   'b':np.random.rand(10)})

def f1(row):
    if row['a'] < 5 & row['b'] < 0.5:
        value = 'less'
    elif row['a'] < 5 & row['b'] > 0.5:
        value = 'more'
    else:
        value = 'same'
    return value

mydf['GRP'] = mydf.apply(f1, axis=1)

ypeError: ("unsupported operand type(s) for &: 'int' and 'float'", 'occurred at index 0')

EDIT 2: As suggested below, enclosing the comparison operator with parentheses did the trick for the cooked up example. This problem is solved.

However, I am still getting the same error in my my real example. By the way, if I were to use the column 'AMP' with perhaps another column in my table, then everything works and I am able to create df['GRP'] by applying the function f to each row. This shows the problem is related to using df['Time']. But then why am I able to select df[(df['AMP'] > 100) & (df['Time'] > day_1)]? Why would this work in this context, but not when the condition appears in a function?

Upvotes: 3

Views: 1683

Answers (4)

user9799449
user9799449

Reputation:

You have a excelent example here, it is very useful and you could apply filters after groupby. It is a way without using mask.

def get_letter_type(letter):
   if letter.lower() in 'aeiou':
       return 'vowel'
   else:
       return 'consonant'


In [6]: grouped = df.groupby(get_letter_type, axis=1)

https://pandas.pydata.org/pandas-docs/version/0.22/groupby.html

Upvotes: 0

Peter Leimbigler
Peter Leimbigler

Reputation: 11105

Based on your error message and example, there are two things to fix. One is to adjust parentheses for operator precedence in your final elif statement. The other is to avoid mixing datetime.date and Timestamp objects.

Fix 1: change this:

elif (row['AMP'] > 100 & row['Time'] < day_2):

to this:

elif (row['AMP'] > 100) & (row['Time'] < day_2):

These two lines are different because the bitwise & operator takes precedence over the < and > comparison operators, so python attempts to evaluate 100 & row['Time']. A full list of Python operator precedence is here: https://docs.python.org/3/reference/expressions.html#operator-precedence

Fix 2: Change these 3 lines:

import datetime as dt
day_1 = dt.date.today()
day_2 = dt.date.today() - dt.timedelta(days = 1)

to these 2 lines:

day1 = pd.to_datetime('today')
day_2 = day_1 - pd.DateOffset(days=1)

Upvotes: 2

edesz
edesz

Reputation: 12406

If you don't need to use a custom function, then you can use multiple masks (somewhat similar to this SO post)

For the Time column, I used this code. It may be that you were trying to compare Time column values that did not have the required dtype (??? this is my guess)

import datetime as dt
mydf['Time'] = pd.date_range(start='10/14/2018', end=dt.date.today())
day_1 = pd.to_datetime(dt.date.today())
day_2 = day_1 - pd.DateOffset(days = 1)

Here is the raw data

mydf

   a         b       Time
0  0  0.550149 2018-10-14
1  1  0.889209 2018-10-15
2  2  0.845740 2018-10-16
3  3  0.340310 2018-10-17
4  4  0.613575 2018-10-18
5  5  0.229802 2018-10-19
6  6  0.013724 2018-10-20
7  7  0.810413 2018-10-21
8  8  0.897373 2018-10-22
9  9  0.175050 2018-10-23

One approach involves using masks for columns

# Append new column
mydf['GRP'] = 'same'
# Use masks to change values in new column
mydf.loc[(mydf['a'] < 5) & (mydf['b'] < 0.5) & (mydf['Time'] < day_2), 'GRP'] = 'less'
mydf.loc[(mydf['a'] < 5) & (mydf['b'] > 0.5) & (mydf['Time'] > day_1), 'GRP'] = 'more'
mydf

   a         b       Time   GRP
0  0  0.550149 2018-10-14  same
1  1  0.889209 2018-10-15  same
2  2  0.845740 2018-10-16  same
3  3  0.340310 2018-10-17  less
4  4  0.613575 2018-10-18  same
5  5  0.229802 2018-10-19  same
6  6  0.013724 2018-10-20  same
7  7  0.810413 2018-10-21  same
8  8  0.897373 2018-10-22  same
9  9  0.175050 2018-10-23  same

Another approach is to set a, b and Time as a multi-index and use index-based masks to set values

mydf.set_index(['a','b','Time'], inplace=True)

# Get Index level values
a = mydf.index.get_level_values('a')
b = mydf.index.get_level_values('b')
t = mydf.index.get_level_values('Time')

# Apply index-based masks
mydf['GRP'] = 'same'
mydf.loc[(a < 5) & (b < 0.5) & (t < day_2), 'GRP'] = 'less'
mydf.loc[(a < 5) & (b > 0.5) & (t > day_1), 'GRP'] = 'more'
mydf.reset_index(drop=False, inplace=True)
mydf

   a         b       Time   GRP
0  0  0.550149 2018-10-14  same
1  1  0.889209 2018-10-15  same
2  2  0.845740 2018-10-16  same
3  3  0.340310 2018-10-17  less
4  4  0.613575 2018-10-18  same
5  5  0.229802 2018-10-19  same
6  6  0.013724 2018-10-20  same
7  7  0.810413 2018-10-21  same
8  8  0.897373 2018-10-22  same
9  9  0.175050 2018-10-23  same

Source to filter by datetime and create a range of dates.

Upvotes: 1

Xukrao
Xukrao

Reputation: 8634

Some parentheses need to be added in the if-statements:

import numpy as np
import pandas as pd

mydf = pd.DataFrame({'a':np.arange(10),
   'b':np.random.rand(10)})

def f1(row):
    if (row['a'] < 5) & (row['b'] < 0.5):
        value = 'less'
    elif (row['a'] < 5) & (row['b'] > 0.5):
        value = 'more'
    else:
        value = 'same'
    return value

mydf['GRP'] = mydf.apply(f1, axis=1)

Upvotes: 1

Related Questions