Reputation: 195
I am following the suggestions here pandas create new column based on values from other columns but still getting an error. Basically, my Pandas dataframe has many columns and I want to group the dataframe based on a new categorical column whose value depends on two existing columns (AMP, Time).
df
df['Time'] = pd.to_datetime(df['Time'])
#making sure Time column read from the csv file is time object
import datetime as dt
day_1 = dt.date.today()
day_2 = dt.date.today() - dt.timedelta(days = 1)
def f(row):
if (row['AMP'] > 100) & (row['Time'] > day_1):
val = 'new_positives'
elif (row['AMP'] > 100) & (day_2 <= row['Time'] <= day_1):
val = 'rec_positives'
elif (row['AMP'] > 100 & row['Time'] < day_2):
val = 'old_positives'
else:
val = 'old_negatives'
return val
df['GRP'] = df.apply(f, axis=1) #this gives the following error:
TypeError: ("Cannot compare type 'Timestamp' with type 'date'", 'occurred at index 0')
df[(df['AMP'] > 100) & (df['Time'] > day_1)] #this works fine
df[(df['AMP'] > 100) & (day_2 <= df['Time'] <= day_1)] #this works fine
df[(df['AMP'] > 100) & (df['Time'] < day_2)] #this works fine
#df = df.groupby('GRP')
I am able to select the proper sub-dataframes based on the conditions specified above, but when I apply the above function on each row, I get the error. What is the correct approach to group the dataframe based on the conditions listed?
EDIT:
Unforunately, I cannot provide a sample of my dataframe. However, here is simple dataframe that gives an error of the same type:
import numpy as np
import pandas as pd
mydf = pd.DataFrame({'a':np.arange(10),
'b':np.random.rand(10)})
def f1(row):
if row['a'] < 5 & row['b'] < 0.5:
value = 'less'
elif row['a'] < 5 & row['b'] > 0.5:
value = 'more'
else:
value = 'same'
return value
mydf['GRP'] = mydf.apply(f1, axis=1)
ypeError: ("unsupported operand type(s) for &: 'int' and 'float'", 'occurred at index 0')
EDIT 2: As suggested below, enclosing the comparison operator with parentheses did the trick for the cooked up example. This problem is solved.
However, I am still getting the same error in my my real example. By the way, if I were to use the column 'AMP' with perhaps another column in my table, then everything works and I am able to create df['GRP'] by applying the function f to each row. This shows the problem is related to using df['Time']. But then why am I able to select df[(df['AMP'] > 100) & (df['Time'] > day_1)]? Why would this work in this context, but not when the condition appears in a function?
Upvotes: 3
Views: 1683
Reputation:
You have a excelent example here, it is very useful and you could apply filters after groupby. It is a way without using mask.
def get_letter_type(letter):
if letter.lower() in 'aeiou':
return 'vowel'
else:
return 'consonant'
In [6]: grouped = df.groupby(get_letter_type, axis=1)
https://pandas.pydata.org/pandas-docs/version/0.22/groupby.html
Upvotes: 0
Reputation: 11105
Based on your error message and example, there are two things to fix. One is to adjust parentheses for operator precedence in your final elif
statement. The other is to avoid mixing datetime.date
and Timestamp
objects.
Fix 1: change this:
elif (row['AMP'] > 100 & row['Time'] < day_2):
to this:
elif (row['AMP'] > 100) & (row['Time'] < day_2):
These two lines are different because the bitwise &
operator takes precedence over the <
and >
comparison operators, so python attempts to evaluate 100 & row['Time']
. A full list of Python operator precedence is here: https://docs.python.org/3/reference/expressions.html#operator-precedence
Fix 2: Change these 3 lines:
import datetime as dt
day_1 = dt.date.today()
day_2 = dt.date.today() - dt.timedelta(days = 1)
to these 2 lines:
day1 = pd.to_datetime('today')
day_2 = day_1 - pd.DateOffset(days=1)
Upvotes: 2
Reputation: 12406
If you don't need to use a custom function, then you can use multiple masks (somewhat similar to this SO post)
For the Time column
, I used this code. It may be that you were trying to compare Time
column values that did not have the required dtype
(??? this is my guess)
import datetime as dt
mydf['Time'] = pd.date_range(start='10/14/2018', end=dt.date.today())
day_1 = pd.to_datetime(dt.date.today())
day_2 = day_1 - pd.DateOffset(days = 1)
Here is the raw data
mydf
a b Time
0 0 0.550149 2018-10-14
1 1 0.889209 2018-10-15
2 2 0.845740 2018-10-16
3 3 0.340310 2018-10-17
4 4 0.613575 2018-10-18
5 5 0.229802 2018-10-19
6 6 0.013724 2018-10-20
7 7 0.810413 2018-10-21
8 8 0.897373 2018-10-22
9 9 0.175050 2018-10-23
One approach involves using masks for columns
# Append new column
mydf['GRP'] = 'same'
# Use masks to change values in new column
mydf.loc[(mydf['a'] < 5) & (mydf['b'] < 0.5) & (mydf['Time'] < day_2), 'GRP'] = 'less'
mydf.loc[(mydf['a'] < 5) & (mydf['b'] > 0.5) & (mydf['Time'] > day_1), 'GRP'] = 'more'
mydf
a b Time GRP
0 0 0.550149 2018-10-14 same
1 1 0.889209 2018-10-15 same
2 2 0.845740 2018-10-16 same
3 3 0.340310 2018-10-17 less
4 4 0.613575 2018-10-18 same
5 5 0.229802 2018-10-19 same
6 6 0.013724 2018-10-20 same
7 7 0.810413 2018-10-21 same
8 8 0.897373 2018-10-22 same
9 9 0.175050 2018-10-23 same
Another approach is to set a
, b
and Time
as a multi-index and use index-based masks to set values
mydf.set_index(['a','b','Time'], inplace=True)
# Get Index level values
a = mydf.index.get_level_values('a')
b = mydf.index.get_level_values('b')
t = mydf.index.get_level_values('Time')
# Apply index-based masks
mydf['GRP'] = 'same'
mydf.loc[(a < 5) & (b < 0.5) & (t < day_2), 'GRP'] = 'less'
mydf.loc[(a < 5) & (b > 0.5) & (t > day_1), 'GRP'] = 'more'
mydf.reset_index(drop=False, inplace=True)
mydf
a b Time GRP
0 0 0.550149 2018-10-14 same
1 1 0.889209 2018-10-15 same
2 2 0.845740 2018-10-16 same
3 3 0.340310 2018-10-17 less
4 4 0.613575 2018-10-18 same
5 5 0.229802 2018-10-19 same
6 6 0.013724 2018-10-20 same
7 7 0.810413 2018-10-21 same
8 8 0.897373 2018-10-22 same
9 9 0.175050 2018-10-23 same
Source to filter by datetime and create a range of dates.
Upvotes: 1
Reputation: 8634
Some parentheses need to be added in the if-statements:
import numpy as np
import pandas as pd
mydf = pd.DataFrame({'a':np.arange(10),
'b':np.random.rand(10)})
def f1(row):
if (row['a'] < 5) & (row['b'] < 0.5):
value = 'less'
elif (row['a'] < 5) & (row['b'] > 0.5):
value = 'more'
else:
value = 'same'
return value
mydf['GRP'] = mydf.apply(f1, axis=1)
Upvotes: 1