How to filter and drop rows based on groups when condition is specific

Question

So I struggled to even come up with a title for this question. Not sure I can edit the question title, but I would be happy to do so once there is clarity.

I have a data set from an experiment where each row is a point in time for a specific group. [Edited based on better approach to generate data by Daniela Vera below]

df = pd.DataFrame({'x1': np.random.randn(30),'time': [1,2,3,4,5,6,7,8,9,10] * 3,'grp': ['c', 'c', 'c','a','a','b','b','c','c','c'] * 3})


df.head(10)

    x1        time  grp
0   0.533131    1   c
1   1.486672    2   c
2   1.560158    3   c
3   -1.076457   4   a
4   -1.835047   5   a
5   -0.374595   6   b
6   -1.301875   7   b
7   -0.533907   8   c
8   0.052951    9   c
9   -0.257982   10  c
10  -0.442044   1   c

In the dataset some people/group only start to have values after time 5. In this case group b. However, in the dataset I am working with there are up to 5,000 groups rather than just the 3 groups in this example.

I would like to be able to identify everyone that only have values that appear after time 5, and drop them from the overall dataframe. I have come up with a solution that works, but I feel like it is very clunky, and wondered if there was something cleaner.

# First I split the data into before and after the time of interest
after = df[df['time'] > 5].copy()    
before = df[df['time'] < 5].copy() 

#Then I merge the two dataframes and use indicator to find out which ones only appear after time 5. 
missing = pd.merge(after,before, on='grp', how='outer', indicator = True)

#Then I use groupby and nunique to identify the groups that only appear after time 5 and save it as 
 an array
something = missing[missing['_merge'] == 'left_only'].groupby('ent_id').nunique()

#I extract the list of group ids from the array
something = something.index

# I go back to my main dataframe and make group id the index
df = df.set_index('grp')

#I then apply .drop on the array of group ids 
    df = df.drop(something)
    df = df.reset_index()

Like I said, super clunky. But I just couldn't figure out an alternative. Please let me know if anything isn't clear and I'll happily edit with more details.

How to filter and drop rows based on groups when condition is specific

Answers (1)

Related Questions