Kedar Joshi
Kedar Joshi

Reputation: 1462

Python Pandas: Comparison of elements in Dataframe/series

I have a DataFrame in a variable called "myDataFrame" that looks like this:

+---------+-----+-------+-----
| Type    | Count  |  Status |
+---------+-----+-------+-----
| a       |  70    |     0   |
| a       |  70    |     0   |
| b       |  70    |     0   |
| c       |  74    |     3   |
| c       |  74    |     2   |
| c       |  74    |     0   |
+---------+-----+-------+----+

I am using vectorized approach to process the rows in this DataFrame since the amount of rows I have is about 116 million.

So I wrote something like this:

myDataFrame['result'] = processDataFrame(myDataFrame['status'], myDataFrame['Count'])

In my function, I am trying to do this:

def processDataFrame(status, count):
    resultsList = list()
    if status == 0:
       resultsList.append(count + 10000)
    else:
       resultsList.append(count - 10000)

    return resultsList

But I get this for comparison status values:

Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()

What am i missing?

Upvotes: 0

Views: 52

Answers (2)

Tom
Tom

Reputation: 8790

I think your function is not really doing the vectorized part.

When it is called, you pass status = myDataFrame['status'], so when it gets to the first if, it checks the condition of myDataFrame['status'] == 0. But myDataFrame['status'] == 0 is a boolean series (of whether each element of the status column equals 0), so it doesn't have a single Truth value (hence the error). Similarly, if the condition could be met, the resultsList would just get the whole "Count" column appended, either all plus 10000 or all minus 10000.


Edit:

I suppose this function uses the built in pandas functions, but applies them in your function:

def processDataFrame(status, count):
    status_0 = (status == 0)
    output = count.copy() #if you don't want to modify in place
    output[status_0] += 10
    output[~status_0] -= 10 
    return output

Upvotes: 0

BENY
BENY

Reputation: 323236

We can do without self-def function

myDataFrame['result'] = np.where(myDataFrame['status']==0,
                                 myDataFrame['Count']+10000,
                                 myDataFrame['Count']-10000)

Update

df.apply(lambda x : processDataFrame(x['Status'],x['Count']),1)
0    [10070]
1    [10070]
2    [10070]
3    [-9926]
4    [-9926]
5    [10074]
dtype: object

Upvotes: 5

Related Questions