Reputation: 57
I have a column of a dataframe with values for customer 'Length of Relationship'. I want to convert these values to numbers from 1-4 based on if they are below average for terminated relationship length, above average, above 1 standard deviation, and above 2 standard deviations. Without looping through using a for loop, is there an easier/faster way to do this?
Here is my code so far:
average = terminatedDf['Relationship Length'].mean()
standardDeviation = terminatedDf['Relationship Length'].std()
lorScores = {np.arange(0, average): 1, np.arange(average, standardDeviation): 2, np.arange(standardDeviation, standardDeviation*2): 3, np.arange(standardDeviation*2, 150): 4}
reportDf['Length of Relationship Score'] = reportDf['Relationship Length'].map(lorScores)
My problem is that numpy arrays are not hashable, but using the regular range function only allows for integers.
I suppose I could loop through the dataframe given that it is only ~ 1500 rows like this:
for row in reportDf:
if row[5] < average:
row[15] = 1
else:
....
I'm not sure even if I got the dictionary to work that .map would be any more efficient than the for loop. Is there a better way to do this? My intuition tells me this may just be an inefficient task to begin with. Here's a sample of what the Dataframe looks like, but it is actually downloaded from a salesforce API.
reportDf = ({'Owner': ['Bob', 'Jane', 'Alice', 'Fred'],
'Name': ['John', 'Johnny', 'Suzie', 'Larry']
'Relationship Length': [0.78, 0.73, 19.36, 7.36]
})
Average length is about 6.3 and standard deviation is about 3.4
Upvotes: 3
Views: 1014
Reputation: 35676
Try pd.cut
:
import pandas as pd
reportDf = pd.DataFrame({
'Owner': ['Bob', 'Jane', 'Alice', 'Fred'],
'Name': ['John', 'Johnny', 'Suzie', 'Larry'],
'Relationship Length': [0.78, 0.73, 19.36, 7.36]
})
average = reportDf['Relationship Length'].mean()
standardDeviation = reportDf['Relationship Length'].std()
bins = [0, average, average + standardDeviation,
average + (standardDeviation * 2), 150]
labels = [1, 2, 3, 4]
reportDf['Length of Relationship Score'] = pd.cut(
reportDf['Relationship Length'],
bins=bins,
labels=labels,
# right=False means upperbound non-inclusive
# In the question you have row[5] < average
# So this is set to use strictly less than
# Remove if this is not the desired behaviour
right=False
)
print(reportDf)
reportDF
:
Owner Name Relationship Length Length of Relationship Score
0 Bob John 0.78 1
1 Jane Johnny 0.73 1
2 Alice Suzie 19.36 3
3 Fred Larry 7.36 2
Upvotes: 1
Reputation: 36590
I would create function for that and use pandas.Series.apply
following way
def get_score(x):
if x <= average:
return 1
if average < x <= standardDeviation:
return 2
if standardDeviation < x <= standardDeviation*2:
return 3
if standardDeviation*2 < x <= 150:
return 4
reportDf['Length of Relationship Score'] = reportDf['Relationship Length'].apply(get_score)
I do not know if it will be faster, but if your have 1500 rows or so, ultimate optimization might be not worth effort needed to implement it.
Edit: I found that there is pandas.cut
function made exactly for this task, lets say I want to divide numbers so 0...9 are "few", 10...99 are "tens" and rest undefined:
import pandas as pd
df = pd.DataFrame({'x':[1,9,10,17,45,99,100,121]})
df['category'] = pd.cut(df['x'], bins=[0,9,99], labels=['few','tens'])
print(df)
output
x category
0 1 few
1 9 few
2 10 tens
3 17 tens
4 45 tens
5 99 tens
6 100 NaN
7 121 NaN
Upvotes: 2
Reputation: 196
Idk if you can do this with no loop at all, tbh.
But you should basically never go for raw loops like for row in reportDf:
! They end up being super slow and there is always a better solution. E.g. iteritems() and list comprehensions. As this have multiple 'if' conditions I don't think it can be done in LC.
scores = []
for index, duration in terminatedDf['Relationship Length'].iteritems():
if duration < average:
scores.append(1)
elif average < duration < average + standardDeviation:
scores.append(2)
elif average + standardDeviation < duration < average + 2*standardDeviation:
scores.append(3)
else:
scores.append(4)
terminatedDf["Scores"] = scores
Upvotes: 0