Bokone Mahal
Bokone Mahal

Reputation: 11

filtering rows in pandas dataframe with a specific condition

I have a pandas dataframe which is tab-separated. I would like to get the rows containing a specific string in the first part of the third column. See this example:

col1       col2    col3    

a            a1    good | bad 

b            b1    bad | bad 

c            c1    foogood | foobad 

d            d1    bad | good  

e            e1    good | good  

f            f1    bad | foobad   

Here, I want to separate the third column on the | character. I then want to take the ones having string good in the first part. So the desired output would be

 col1           col2        col3    

    a            a1       good | bad 

    c            c1       foogood | foobad 

    e            e1        good | good  

I know that this can be done using the apply function, but I don't know how to define such a function using apply. I also tried filtering using [] but could not get it to work. Appreciate your help brothers and sisters.

Update: I also have the underlying TSV file and know this can be done using grep and awk as well. So either solution is fine.

Upvotes: 1

Views: 295

Answers (3)

oguz ismail
oguz ismail

Reputation: 50750

This is very easy with awk.

$ cat file
col1    col2    col3
a       a1      good | bad
b       b1      bad | bad
c       c1      foogood | foobad
d       d1      bad | good
e       e1      good | good
f       f1      bad | foobad
$
$ awk -F $'\t' '(NR == 1 || $3 ~ /^[^|]*good/)' file
col1    col2    col3
a       a1      good | bad
c       c1      foogood | foobad
e       e1      good | good

Also, you can give sed a try:

sed -n '1p; /^[^\t]*\t[^\t]*\t[^|]*good/p' file

which produces the desired result as well.

Upvotes: 1

edesz
edesz

Reputation: 12406

If the required text starts with good, and is always followed by |, then you can combine these into a single substring (with this order) directly in .contains(). Pandas can filter out these rows.

Input DF

df
  col1 col2              col3
0    a   a1        good | bad
1    b   b1         bad | bad
2    c   c1  foogood | foobad
3    d   d1        bad | good
4    e   e1       good | good
5    f   f1      bad | foobad

Filtered DF

df[df['col3'].str.contains('good | \n')]
  col1 col2              col3
0    a   a1        good | bad
2    c   c1  foogood | foobad
4    e   e1       good | good

You could do the same with bad (instead of good)

df[df['col3'].str.contains('bad | \n')]
  col1 col2          col3
1    b   b1     bad | bad
3    d   d1    bad | good
5    f   f1  bad | foobad

Upvotes: 1

BENY
BENY

Reputation: 323226

You can using str.split with contains

yourdf=df[df.col3.str.split('|').str[0].str.contains('good')]

Upvotes: 2

Related Questions