yatu
yatu

Reputation: 88236

Groupby search first and last True values

I have a pd.Series with duplicated indices, and each index containing a set of booleans:

FA155    False
FA155    False
FA155    False
FA155    True
FA155    True
FA155    True
FA155    True
FA155    True
FA155    False

What I'm trying to do for each different index in an efficient way, is to keep only as True the first and last True values of the sequence, and set the rest to False. There can also be False values between those that are True.

So for this sample the result would be:

FA155    False
FA155    False
FA155    False
FA155    True
FA155    False
FA155    False
FA155    False
FA155    True
FA155    False

Any help would be very appreciated.

Upvotes: 9

Views: 1381

Answers (3)

BENY
BENY

Reputation: 323236

This is based on the diff to get the group starting point , I am using iloc twice since you need keep the head and tail True

df1=df.copy()
df.loc[df]=df.astype(int).diff().ne(0)[df]
df=df.iloc[::-1]
df1=df1.iloc[::-1]
df.loc[df1]+=df1.astype(int).diff().ne(0)[df1]
df=df.iloc[::-1]

Upvotes: 1

Mohamed AL ANI
Mohamed AL ANI

Reputation: 2062

You filter True values and then you aggregate to find the first and last values. Then you can use loc to replace those values in df. df is your dataframe. col is the name of your column with True and False values

df["nb"] = range(df.shape[0])
df.reset_index(inplace=True)

elem = (df[df[col]==True].groupby("index")["nb"].agg({ "first_True": 'first', "last_True":"last"})).values

indexes_to_False = sum(elem.tolist(), [])

df.loc[indexes_to_False, col] = False

Then you can drop the column nb and reindex if you wish

Upvotes: 1

rafaelc
rafaelc

Reputation: 59274

You can use loc with idxmax with both your original df and your inverted df.

This will yield the index of your first and last True values. Just set the different indexes to False afterwards.

For example:

Setup

z = sio("""i    v
FA154    False
FA155    False
FA155    True
FA155    True
FA155    True
FA155    True
FA155    True
FA155    False
FA156    False
FA156    True
FA156    False
FA156    False
FA156    True""")

df = pd.read_table(z, delim_whitespace=True)

    i       v
0   FA154   False
1   FA155   False
2   FA155   True
3   FA155   True
4   FA155   True
5   FA155   True
6   FA155   True
7   FA155   False
8   FA156   False
9   FA156   True
10  FA156   False
11  FA156   False
12  FA156   True

idxmax()

Which is the same thing as getting your df and using reset_index. Then, get list of indexes for you first (v1) and last (v2) True values:

v1 = df.groupby("i").v.idxmax().values
v2 = df[::-1].groupby("i").v.idxmax().values

And use your logic:

df.loc[v1, "v"] = True & df.loc[v1, "v"]
df.loc[v2, "v"] = True & df.loc[v2, "v"]
df.loc[~df.index.isin(np.concatenate([v1,v2])), "v"] = False

The idea behind using & is not to accidentally set any False values to True.

Result:

>>> df.set_index("i")

        v
i   
FA154   False
FA155   False
FA155   True
FA155   False
FA155   False
FA155   False
FA155   True
FA155   False
FA156   False
FA156   True
FA156   False
FA156   False
FA156   True

Upvotes: 3

Related Questions