Cole1998
Cole1998

Reputation: 57

Is there an efficient/best way to assign scores to values in a pandas dataframe column?

I have a column of a dataframe with values for customer 'Length of Relationship'. I want to convert these values to numbers from 1-4 based on if they are below average for terminated relationship length, above average, above 1 standard deviation, and above 2 standard deviations. Without looping through using a for loop, is there an easier/faster way to do this?

Here is my code so far:

average = terminatedDf['Relationship Length'].mean()

standardDeviation = terminatedDf['Relationship Length'].std()

lorScores = {np.arange(0, average): 1, np.arange(average, standardDeviation): 2, np.arange(standardDeviation, standardDeviation*2): 3, np.arange(standardDeviation*2, 150): 4}

reportDf['Length of Relationship Score'] = reportDf['Relationship Length'].map(lorScores)

My problem is that numpy arrays are not hashable, but using the regular range function only allows for integers.

I suppose I could loop through the dataframe given that it is only ~ 1500 rows like this:

for row in reportDf:
    if row[5] < average: 
        row[15] = 1
    else:
     ....

I'm not sure even if I got the dictionary to work that .map would be any more efficient than the for loop. Is there a better way to do this? My intuition tells me this may just be an inefficient task to begin with. Here's a sample of what the Dataframe looks like, but it is actually downloaded from a salesforce API.

reportDf = ({'Owner': ['Bob', 'Jane', 'Alice', 'Fred'], 
             'Name': ['John', 'Johnny', 'Suzie', 'Larry']
             'Relationship Length': [0.78, 0.73, 19.36, 7.36]
             })

Average length is about 6.3 and standard deviation is about 3.4

Upvotes: 3

Views: 1014

Answers (3)

Henry Ecker
Henry Ecker

Reputation: 35676

Try pd.cut:

import pandas as pd

reportDf = pd.DataFrame({
    'Owner': ['Bob', 'Jane', 'Alice', 'Fred'],
    'Name': ['John', 'Johnny', 'Suzie', 'Larry'],
    'Relationship Length': [0.78, 0.73, 19.36, 7.36]
})

average = reportDf['Relationship Length'].mean()

standardDeviation = reportDf['Relationship Length'].std()

bins = [0, average, average + standardDeviation,
        average + (standardDeviation * 2), 150]
labels = [1, 2, 3, 4]

reportDf['Length of Relationship Score'] = pd.cut(
    reportDf['Relationship Length'],
    bins=bins,
    labels=labels,
    # right=False means upperbound non-inclusive
    # In the question you have row[5] < average
    # So this is set to use strictly less than
    # Remove if this is not the desired behaviour
    right=False
)

print(reportDf)

reportDF:

   Owner    Name  Relationship Length Length of Relationship Score
0    Bob    John                 0.78                            1
1   Jane  Johnny                 0.73                            1
2  Alice   Suzie                19.36                            3
3   Fred   Larry                 7.36                            2

Upvotes: 1

Daweo
Daweo

Reputation: 36590

I would create function for that and use pandas.Series.apply following way

def get_score(x):
    if x <= average:
         return 1
    if average < x <= standardDeviation:
         return 2
    if standardDeviation < x <= standardDeviation*2:
         return 3
    if standardDeviation*2 < x <= 150:
         return 4

reportDf['Length of Relationship Score'] = reportDf['Relationship Length'].apply(get_score)

I do not know if it will be faster, but if your have 1500 rows or so, ultimate optimization might be not worth effort needed to implement it.

Edit: I found that there is pandas.cut function made exactly for this task, lets say I want to divide numbers so 0...9 are "few", 10...99 are "tens" and rest undefined:

import pandas as pd
df = pd.DataFrame({'x':[1,9,10,17,45,99,100,121]})
df['category'] = pd.cut(df['x'], bins=[0,9,99], labels=['few','tens'])
print(df)

output

     x category
0    1      few
1    9      few
2   10     tens
3   17     tens
4   45     tens
5   99     tens
6  100      NaN
7  121      NaN

Upvotes: 2

Pedro Kaneto Suzuki
Pedro Kaneto Suzuki

Reputation: 196

Idk if you can do this with no loop at all, tbh.

But you should basically never go for raw loops like for row in reportDf:! They end up being super slow and there is always a better solution. E.g. iteritems() and list comprehensions. As this have multiple 'if' conditions I don't think it can be done in LC.

scores = []
for index, duration in terminatedDf['Relationship Length'].iteritems():
   if duration < average:
      scores.append(1)
   elif average < duration < average + standardDeviation:
      scores.append(2)
   elif average + standardDeviation < duration < average + 2*standardDeviation:
      scores.append(3)
   else:
      scores.append(4)

terminatedDf["Scores"] = scores

Upvotes: 0

Related Questions