Reputation: 29

Creating a function to iterate through DataFrame

I am running into an issue creating a function that will recognize if a particular value in a column is between two values.

def bid(x):
if df['tla'] < 85000:
    return 1
elif (df['tla'] >= 85000) & (df['tla'] < 110000):
    return 2
elif (df['tla'] >= 111000) & (df['tla'] < 126000):
    return 3
elif (df['tla'] >= 126000) & (df['tla'] < 150000):
    return 4
elif (df['tla'] >= 150000) & (df['tla'] < 175000):
    return 5
elif (df['tla'] >= 175000) & (df['tla'] < 200000):
    return 6
elif (df['tla'] >= 200000) & (df['tla'] < 250000):
    return 7
elif (df['tla'] >= 250000) & (df['tla'] < 300000):
    return 8
elif (df['tla'] >= 300000) & (df['tla'] < 375000):
    return 9
elif (df['tla'] >= 375000) & (df['tla'] < 453100):
    return 10
elif df['tla'] >= 453100:
    return 11

I apply that to my new column:

df['bid_bucket'] = df['bid_bucket'].apply(bid)

And I am getting this error back:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Anyone have any ideas?

Upvotes: 0

Answers (5)

Naga kiran

Reputation: 4607

You can simply use the np.digitize function to assign the ranges

df['bid_bucket'] = np.digitize(df['bid_bucket'],np.arange(85000,453100,25000))

Example

a = np.random.randint(85000,400000,10)
#array([305628, 134122, 371486, 119856, 321423, 346906, 319321, 165714,360896, 206404])
bins=[-np.inf, 85000, 110000, 126000, 150000, 175000,
             200000, 250000, 300000, 375000, 453100, np.inf]
np.digitize(a,bins)

Out:

array([9, 4, 9, 3, 9, 9, 9, 5, 9, 7])

Upvotes: 2

ALollz

Reputation: 59579

This can already be accomplished with pd.cut, defining the bin edges, and adding +1 to the labels to get your numbering to start at 1.

import pandas as pd
import numpy as np
df = pd.DataFrame({'tla': [7, 85000, 111000, 88888, 51515151]})

df['bid_bucket'] = pd.cut(df.tla, right=False,
                          bins=[-np.inf, 85000, 110000, 126000, 150000, 175000,
                                200000, 250000, 300000, 375000, 453100, np.inf], 
                          labels=False)+1

Output: df

        tla  bid_bucket
0         7           1
1     85000           2
2    111000           3
3     88888           2
4    126000           4
5  51515151          11

Upvotes: 2

Paul-Darius

Reputation: 126

You have two possibilities. Either apply a function defined on a row on the pandas DataFrame in a row-wise way:

def function_on_a_row(row):
  if row.tla > ...
    ...

df.apply(function_on_a_row, axis=1)

In which case keep bid the way you defined it but replace the parameter x with a word like "row" and then the df with "row" to keep the parameters name meaningful, and use:

df.bid_bucket = df.apply(bid, axis=1)

Or apply a function defined on an element on a pandas Series.

def function_on_an_elt(element_of_series):
  if element_of_series > ...
    ...

df.new_column = df.my_column_of_interest.apply(function_on_an_elt)

In your case redefine bid accordingly.

Here you tried to mix both approaches, which does not work.

Upvotes: 1

David

Reputation: 220

To keep it in pandas: I think referencing df['tla'] in your function means to reference a series instead of a single value which leads to the ambiguity. You should provide the specific value instead. You could use lambda x, then your code could be something like this

df = pd.DataFrame({'tla':[10,123456,999999]})

def bid(x):
    if x < 85000:
        return 1
    elif (x >= 85000 and x < 110000):
        return 2
    elif (x >= 111000 and x < 126000):
        return 3
    elif (x >= 126000 and x < 150000):
        return 4
    elif (x >= 150000 and x < 175000):
        return 5
    elif (x >= 175000 and x < 200000):
        return 6
    elif (x >= 200000 and x < 250000):
        return 7
    elif (x >= 250000 and x < 300000):
        return 8
    elif (x >= 300000 and x < 375000):
        return 9
    elif (x >= 375000 and x < 453100):
        return 10
    elif x >= 453100:
        return 11

df['bid_bucket'] = df['tla'].apply(lambda x: bid(x))
df

Upvotes: 1

gyx-hh

Reputation: 1431

try the following using numpy.select

import numpy as np

values = [1,2,3,4,5,6,7,8,9,10,11]
cond = [df['tla']<85000, (df['tla'] >= 850000) & (df['tla'] < 110000), .... ]

df['bid_bucket'] = np.select(cond, values)

Upvotes: 3

Creating a function to iterate through DataFrame

Answers (5)

Related Questions