Reputation:

Remove rows from a pandas df

I am trying to delete all rows from a pandas df. Specifically, when the row beneath X in Col A is empty. So if the row underneath X in Col A is empty I want to delete all those rows until there is a string underneath value X

import pandas as pd

d = ({
    'A' : ['X','','','X','Foo','','X','Fou','','X','Bar'],           
    'B' : ['Val',1,3,'Val',1,3,'Val',1,3,'Val',1],
    'C' : ['Val',2,4,'Val',2,4,'Val',2,4,'Val',2],
    })

df = pd.DataFrame(data=d)

Output:

      A    B    C
0     X  Val  Val
1          1    2
2          3    4
3     X  Val  Val
4   Foo    1    2
5          3    4
6     X  Val  Val
7   Fou    1    2
8          3    4
9     X  Val  Val
10  Bar    1    2

I have tried:

df = df[~(df['A'] == 'X').shift().fillna(False)]

But this deletes everything followed by X. I only want it to be deleted if the next row below X is empty.

Intended:

     A    B    C
0    X  Val  Val
1  Foo    1    2
2         3    4
3    X  Val  Val
4  Fou    1    2
5         4    4
6    X  Val  Val
7  Bar    1    2

Upvotes: 4

Answers (3)

jezrael

Reputation: 863166

Use:

m1 = df['A'] == 'X'
g =  m1.cumsum()
m = (df['A'] == '') | m1

df = df[~m.groupby(g).transform('all')]
print (df)
      A    B    C
3     X  Val  Val
4   Foo    1    2
5          3    4
6     X  Val  Val
7   Fou    1    2
8          3    4
9     X  Val  Val
10  Bar    1    2

Details:

m1 = df['A'] == 'X'
g =  m1.cumsum()
m = (df['A'] == '') | m1

print (pd.concat([df,
                  df['A'] == 'X',
                  m1.cumsum(),
                  (df['A'] == ''), 
                  m,
                  m.groupby(g).transform('all'),
                  ~m.groupby(g).transform('all')], axis=1,
       keys=['orig','==X','g','==space','m', 'all', 'inverted all']))

   orig              ==X  g ==space      m    all inverted all
      A    B    C      A  A       A      A      A            A
0     X  Val  Val   True  1   False   True   True        False
1          1    2  False  1    True   True   True        False
2          3    4  False  1    True   True   True        False
3     X  Val  Val   True  2   False   True  False         True
4   Foo    1    2  False  2   False  False  False         True
5          3    4  False  2    True   True  False         True
6     X  Val  Val   True  3   False   True  False         True
7   Fou    1    2  False  3   False  False  False         True
8          3    4  False  3    True   True  False         True
9     X  Val  Val   True  4   False   True  False         True
10  Bar    1    2  False  4   False  False  False         True

Explanation:

Compare by X and create cumulative sum for groups starts with X to g
Chain 2 boolean masks - compared X and empty space to m
groupby with transform and DataFrameGroupBy.all for return Trues for groups with only True
Last invert and filter by boolean indexing

Upvotes: 1

CezarySzulc

Reputation: 2009

This is your solution:

(df['A'] == 'X').shift()
0       NaN
1      True
2     False
3     False
4      True
5     False
6     False
7      True
8     False
9     False
10     True
Name: A, dtype: object
In [15]:

(df['A'] == '')
Out[15]:
0     False
1      True
2      True
3     False
4     False
5      True
6     False
7     False
8      True
9     False
10    False
Name: A, dtype: bool
In [14]:

((df['A'] == '') & (df['A'] == 'X').shift())
Out[14]:
0     False
1      True
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
Name: A, dtype: bool

And the result is:

df[~((df['A'] == '') & (df['A'] == 'X').shift())]
Out[16]:
A   B   C
0   X   Val Val
2       3   4
3   X   Val Val
4   Foo 1   2
5       3   4
6   X   Val Val
7   Fou 1   2
8       3   4
9   X   Val Val
10  Bar 1   2

EDIT: If you need you can do it in while loop. old_size_df = df.size new_size_df = 0

while old_size_df != new_size_df:
    old_size_df = df.size
    df = df[~((df['A'] == '') & (df['A'] == 'X').shift())]
    new_size_df = df.size

    A   B   C
0   X   Val Val
3   X   Val Val
4   Foo 1   2
5       3   4
6   X   Val Val
7   Fou 1   2
8       3   4
9   X   Val Val
10  Bar 1   2

Upvotes: 0

Grigoriy Mikhalkin

Reputation: 5573

Here is solution with custom apply function:

d = ({
    'A' : ['X','','','X','Foo','','X','Fou','','X','Bar'],           
    'B' : ['Val',1,3,'Val',1,3,'Val',1,3,'Val',1],
    'C' : ['Val',2,4,'Val',2,4,'Val',2,4,'Val',2],
})
df = pd.DataFrame(data=d)

is_x = False
def fill_empty_a(row):
    global is_x
    if row['A'] == '' and is_x:
            row['A'] = None
    else:
            is_x = row['A'] == 'X'
    return row

(df.apply(fill_empty_a, axis=1)
   .dropna()
   .reset_index(drop=True))
#      A    B    C
# 0    X  Val  Val
# 1    X  Val  Val
# 2  Foo    1    2
# 3         3    4
# 4    X  Val  Val
# 5  Fou    1    2
# 6         3    4
# 7    X  Val  Val
# 8  Bar    1    2

Upvotes: 0

Remove rows from a pandas df

Answers (3)

Related Questions