Reputation: 701

Use cumulative sum to assign a value in python/pyspark

Using Python I'd like to write some code that classifies all items where the cumulative sum of the Miles column <=2.5 as being "IN" and the rest "OUT". Are there any suggestions where to start?

Example Data set

Rank  Name  Miles
  1   A     0.5  
  2   A     1
  3   B     1
  4   B     1
  5   C     2

Desired Output

Rank  Name  Miles  Assign
  1   A     0.5     IN
  2   A     1       IN
  3   B     1       IN
  4   B     1       OUT
  5   C     2       OUT

Upvotes: 2

Answers (1)

wjandrea

Reputation: 33179

It looks like you're using Pandas, though I'm not an expert.

If you have a dataframe like this:

   Rank Name  Miles
0     1    A    0.5
1     2    A    1.0
2     3    B    1.0
3     4    B    1.0
4     5    C    2.0

Then you can simply create a new column where the values are based on the cumulative sum of the Miles column:

df['Assign'] = ['IN' if i <= 2.5 else 'OUT' for i in df['Miles'].cumsum()]

Or, I think this is more idiomatic:

df['Assign'] = ['IN' if i else 'OUT' for i in df['Miles'].cumsum() <= 2.5]

Which becomes:

   Rank Name  Miles Assign
0     1    A    0.5     IN
1     2    A    1.0     IN
2     3    B    1.0     IN
3     4    B    1.0    OUT
4     5    C    2.0    OUT

Upvotes: 1

Use cumulative sum to assign a value in python/pyspark

Answers (1)

Related Questions