Logan Brooks
Logan Brooks

Reputation: 63

Find longest run of consecutive zeros for each user in dataframe

I'm looking to find the max run of consecutive zeros in a DataFrame with the result grouped by user. I'm interested in running the RLE on usage.

sample input:

user--day--usage
A-----1------0
A-----2------0
A-----3------1
B-----1------0
B-----2------1
B-----3------0

Desired output

user---longest_run
a - - - - 2
b - - - - 1

mydata <- mydata[order(mydata$user, mydata$day),]
user <- unique(mydata$user)
d2 <- data.frame(matrix(NA, ncol = 2, nrow = length(user)))
names(d2) <- c("user", "longest_no_usage")
d2$user <- user
for (i in user) {
  if (0 %in% mydata$usage[mydata$user == i]) {
    run <- rle(mydata$usage[mydata$user == i]) #Run Length Encoding
    d2$longest_no_usage[d2$user == i] <- max(run$length[run$values == 0])
  } else {
    d2$longest_no_usage[d2$user == i] <- 0 #some users did not have no-usage days
  }
}
d2 <- d2[order(-d2$longest_no_usage),]

this works in R but I want to do the same thing in python, I'm totally stumped

Upvotes: 4

Views: 2406

Answers (5)

Chris Coffee
Chris Coffee

Reputation: 89

Here's another, NumPy-centric way:

def max0(x):
    z = np.argwhere(x == 0).flatten()
    if not z.size:
        return 0
    z -= np.arange(len(z))
    return np.unique(z, return_counts=True)[-1].max()

Upvotes: 0

RELW
RELW

Reputation: 189

get max number of consecutive zeros on series:

def max0(sr):
     return (sr != 0).cumsum().value_counts().max() - (0 if (sr != 0).cumsum().value_counts().idxmax()==0 else 1)


max0(pd.Series([1,0,0,0,0,2,3]))

4

Upvotes: 2

The Unfun Cat
The Unfun Cat

Reputation: 32008

If you have a large dataset and speed is essential, you might want to try the high-performance pyrle library.

Setup:

# pip install pyrle
# or 
# conda install -c bioconda pyrle
import numpy as np
np.random.seed(0)
import pandas as pd
from pyrle import Rle
size = int(1e7)
number = np.random.randint(2, size=size)
user = np.random.randint(5, size=size)
df = pd.DataFrame({"User": np.sort(user), "Number": number})
df
#          User  Number
# 0           0       0
# 1           0       1
# 2           0       1
# 3           0       0
# 4           0       1
# ...       ...     ...
# 9999995     4       1
# 9999996     4       1
# 9999997     4       0
# 9999998     4       0
# 9999999     4       1
# 
# [10000000 rows x 2 columns]

Execution:

for u, udf in df.groupby("User"):
    r = Rle(udf.Number)
    is_0 = r.values == 0
    print("User", u, "Max", np.max(r.runs[is_0]))
# (Wall time: 1.41 s)


# User 0 Max 20
# User 1 Max 23
# User 2 Max 20
# User 3 Max 22
# User 4 Max 23

Upvotes: 0

jezrael
jezrael

Reputation: 863631

Use groupby with size by columns user, usage and helper Series for consecutive values first:

print (df)
  user  day  usage
0    A    1      0
1    A    2      0
2    A    3      1
3    B    1      0
4    B    2      1
5    B    3      0
6    C    1      1


df1 = (df.groupby([df['user'], 
                   df['usage'].rename('val'), 
                   df['usage'].ne(df['usage'].shift()).cumsum()])
        .size()
        .to_frame(name='longest_run'))

print (df1)
                longest_run
user val usage             
A    0   1                2
     1   2                1
B    0   3                1
         5                1
     1   4                1
C    1   6                1

Then filter only zero rows, get max and add reindex for append non 0 groups:

df2 = (df1.query('val == 0')
          .max(level=0)
          .reindex(df['user'].unique(), fill_value=0)
          .reset_index())
print (df2)
  user  longest_run
0    A            2
1    B            1
2    C            0

Detail:

print (df['usage'].ne(df['usage'].shift()).cumsum())
0    1
1    1
2    2
3    3
4    4
5    5
6    6
Name: usage, dtype: int32

Upvotes: 4

Florian
Florian

Reputation: 25435

I think the following does what you are looking for, where the consecutive_zero function is an adaptation of the top answer here.

Hope this helps!

import pandas as pd
from itertools import groupby

df = pd.DataFrame([['A', 1], ['A', 0], ['A', 0], ['B', 0],['B',1],['C',2]], 
                  columns=["user", "usage"])

def len_iter(items):
    return sum(1 for _ in items)

def consecutive_zero(data):
    x = list((len_iter(run) for val, run in groupby(data) if val==0))
    if len(x)==0: return 0 
    else: return max(x)

df.groupby('user').apply(lambda x: consecutive_zero(x['usage']))

Output:

user
A    2
B    1
C    0
dtype: int64

Upvotes: 0

Related Questions