Jameson
Jameson

Reputation: 187

How to filter and drop rows based on groups when condition is specific

So I struggled to even come up with a title for this question. Not sure I can edit the question title, but I would be happy to do so once there is clarity.

I have a data set from an experiment where each row is a point in time for a specific group. [Edited based on better approach to generate data by Daniela Vera below]

df = pd.DataFrame({'x1': np.random.randn(30),'time': [1,2,3,4,5,6,7,8,9,10] * 3,'grp': ['c', 'c', 'c','a','a','b','b','c','c','c'] * 3})


df.head(10)

    x1        time  grp
0   0.533131    1   c
1   1.486672    2   c
2   1.560158    3   c
3   -1.076457   4   a
4   -1.835047   5   a
5   -0.374595   6   b
6   -1.301875   7   b
7   -0.533907   8   c
8   0.052951    9   c
9   -0.257982   10  c
10  -0.442044   1   c

In the dataset some people/group only start to have values after time 5. In this case group b. However, in the dataset I am working with there are up to 5,000 groups rather than just the 3 groups in this example.

I would like to be able to identify everyone that only have values that appear after time 5, and drop them from the overall dataframe. I have come up with a solution that works, but I feel like it is very clunky, and wondered if there was something cleaner.

# First I split the data into before and after the time of interest
after = df[df['time'] > 5].copy()    
before = df[df['time'] < 5].copy() 

#Then I merge the two dataframes and use indicator to find out which ones only appear after time 5. 
missing = pd.merge(after,before, on='grp', how='outer', indicator = True)

#Then I use groupby and nunique to identify the groups that only appear after time 5 and save it as 
 an array
something = missing[missing['_merge'] == 'left_only'].groupby('ent_id').nunique()

#I extract the list of group ids from the array
something = something.index

# I go back to my main dataframe and make group id the index
df = df.set_index('grp')

#I then apply .drop on the array of group ids 
    df = df.drop(something)
    df = df.reset_index()

Like I said, super clunky. But I just couldn't figure out an alternative. Please let me know if anything isn't clear and I'll happily edit with more details.

Upvotes: 1

Views: 204

Answers (1)

Daniela Varela
Daniela Varela

Reputation: 46

I am not sure If I get it, but let's say you have this data:

df = pd.DataFrame({'x1': np.random.randn(30),'time': [1,2,3,4,5,6,7,8,9,10] * 3,'grp': ['c', 'c', 'c','a','a','b','b','c','c','c'] * 3})

In this case, group "b" just has data for times 6, 7, which is above time 5. You can use this process to get a dictionary with the times in which each group has at least one data point and also a list called "keep" with the groups that have data point over the time 5.

list_groups = ["a","b","c"]
times_per_group = {}
keep = []
for group in list_groups:
    times_per_group[group] = list(df[df.grp ==group].time.unique())
    condition = any([i<5 for i in list(df[df.grp==group].time.unique())])
    if condition:
        keep.append(group)

Finally, you just keep the groups present in the list "keep":

 df = df[df.grp.isin(keep)]

Let me know if I understood your question! Of course you can just simplify the process, the dictionary is just to check, but you actually don´t need the whole code. If this results is what you´re looking for, you can just do:

 keep = [group for group in list_groups if any([i<5 for i in list(df[df.grp == group].time.unique())])]

Upvotes: 2

Related Questions