yanadm
yanadm

Reputation: 707

How to distribute column values to another columns based on condition in pandas

I have a small dataframe. An example is given below.

+-------+---------+---------+------+----------+  
| code  | surplus | deficit | gone | has come |  
+-------+---------+---------+------+----------+
| 0100  |    1000 |       0 |      |          |
| 0103  |       0 |     100 |      |          |
| 0104  |       0 |     600 |      |          |
| 0190  |       0 |       0 |      |          |
| 0191  |       0 |     800 |      |          |
| 0192  |     500 |       0 |      |          |
| 0193  |     700 |       0 |      |          |
| 0194  |       0 |     300 |      |          |
| 0195  |       0 |       0 |      |          |
+-------+---------+---------+------+----------+

I need to distribute the data from the column surplus so as to cover the data in the column deficit. In the column gone write down how much I took from the column surplus, and in the column has come write down how much I added to fill the deficit. From the column surplus I take the data as the values decrease. And always start with the largest deficit.

In my example it will look like this:

To cover the biggest deficit (In the example this is 800) i take part of value 1000.

To cover the next deficit (In the example this is 600) i take the remainder of the value 1000 and I take 400 from the value 700.

To cover the deficit 300 i take the remainder of the value 700.

And, finally, to cover the deficit 100 i take part of value 500.

The result should be the next dataframe:

+------+---------+---------+------+----------+
| code | surplus | deficit | gone | has come |
+------+---------+---------+------+----------+
| 0100 |    1000 |       0 | 1000 |        0 |
| 0103 |       0 |     100 |    0 |      100 |
| 0104 |       0 |     600 |    0 |      600 |
| 0190 |       0 |       0 |    0 |        0 |
| 0191 |       0 |     800 |    0 |      800 |
| 0192 |     500 |       0 |  100 |        0 |
| 0193 |     700 |       0 |  700 |        0 |
| 0194 |       0 |     300 |    0 |      300 |
| 0195 |       0 |       0 |    0 |        0 |
+------+---------+---------+------+----------+

The values in the columns surplus and deficit can be different.

I can't come up an algorithm for this task. I would be grateful for any ideas.

Upvotes: 1

Views: 1786

Answers (2)

Arthur Tacca
Arthur Tacca

Reputation: 9998

I'm afraid I don't know pandas so I can't give the detail of the answer, but here is a general alogrithm that I think would work fine. It's up to you to match this against pandas API.

For the "has come" column:

  • Sort the entries by deficit (descending)
  • Calculate the cumulative sum of deficit entries in this sorted list
  • Bound this by sum(surplus) i.e. create a column max(cumsum(deficit), sum(surplus))
  • Now do the difference of each item with the next one (I think this is the pandas.Series.diff method?), using 0 as the "-1"th entry (maybe you'll have to add a dummy row?). This is your "has come" value

For the "has come" column (if sum(surplus) >= sum(deficit)):

  • In this case you just set "has come" = "deficit" for all rows, which will be faster than the above computation
  • But if you don't check for this case explicitly, the above calculation will still work

For the "gone" column: Just do exactly the same stuff as above, reversing "deficit" and "surplus".

Edit: In your example the gone column is the tricky one, because in that case sum(deficit) < sum(surplus). Here's the above procedure on surplus.

sum(surplus) = 2200
sum(deficit) = 1800

+------+---------+---------+-----------------+-----------------------------------+----------------+
| code | surplus | deficit | cumsum(surplus) | max(cumsum(surplus),sum(deficit)) | diff(prev row) |
+------+---------+---------+-----------------+-----------------------------------+----------------+
| NaN  |       0 |       0 |               0 |                                 0 |           NaN  |
| 0100 |    1000 |       0 |            1000 |                              1000 |           1000 |
| 0193 |     700 |       0 |            1700 |                              1700 |           700  |
| 0192 |     500 |       0 |            2200 |                              1800 |           100  |
| 0191 |       0 |     800 |            2200 |                              1800 |           0    |
| 0103 |       0 |     100 |            2200 |                              1800 |           0    |
| 0104 |       0 |     600 |            2200 |                              1800 |           0    |
| 0190 |       0 |       0 |            2200 |                              1800 |           0    |
| 0194 |       0 |     300 |            2200 |                              1800 |           0    |
| 0195 |       0 |       0 |            2200 |                              1800 |           0    |
+------+---------+---------+-----------------+-----------------------------------+----------------+

The final column is the result you want. Note that I added a dummy row to the start so that I could calculate pairwise differences. It turns out the shift() is the key method you need for calculating that column; see this question

Edit 2: I thought it might be worth adding an alternative solution. It's a bit harder to understand, but it might be a bit easier to implement because you don't need to fiddle with an extra dummy row.

  • As before: Sort the entries by deficit (descending)
  • As before: Calculate the cumulative sum of deficit entries in this sorted list
  • New: Find the index of the first row where the cumulative sum is greater than the sum of the surplus (I don't how easy it is to get this in pandas). Let's call this i (with i=Inf if no such row exists).
  • For all rows before this index (i.e. df[:i]), set "has come" = "deficit"
  • For all rows after this index (i.e. df[i+1:]), set "has come" = 0
  • For that row (i.e. df[i], if i exists), set "has come" to:
    • has come = sum(surplus) - (cumsum(deficit) - deficit)
    • (BTW, (cumsum(deficit) - deficit) is actually equal to cumsum(deficit) of the previous row, or 0 if this is the first row.)

Upvotes: 2

AZJB
AZJB

Reputation: 76

For the question "How to distribute column values to another columns based on condition in pandas?" maybe the function np.where is what you are looking for:

import numpy as np
import pandas as pd 

# df[Column title] = np.where ( condition on this line, if condition true then value to assign, else value to assign)
df["gone"] = np.where((df["surplus"] - df["deficit"]) > 0 , df["surplus"] - df["deficit"] , 0)
df["has come"] = np.where((df["surplus"] - df["deficit"]) < 0 , 0, df["deficit"] - df["surplus"] )

Upvotes: 1

Related Questions