Reputation: 11
I have a pandas dataframe which is tab-separated. I would like to get the rows containing a specific string in the first part of the third column. See this example:
col1 col2 col3
a a1 good | bad
b b1 bad | bad
c c1 foogood | foobad
d d1 bad | good
e e1 good | good
f f1 bad | foobad
Here, I want to separate the third column on the |
character. I then want to take the ones having string good
in the first part. So the desired output would be
col1 col2 col3
a a1 good | bad
c c1 foogood | foobad
e e1 good | good
I know that this can be done using the apply
function, but I don't know how to define such a function using apply
. I also tried filtering using []
but could not get it to work. Appreciate your help brothers and sisters.
Update: I also have the underlying TSV file and know this can be done using grep
and awk
as well. So either solution is fine.
Upvotes: 1
Views: 295
Reputation: 50750
This is very easy with awk.
$ cat file
col1 col2 col3
a a1 good | bad
b b1 bad | bad
c c1 foogood | foobad
d d1 bad | good
e e1 good | good
f f1 bad | foobad
$
$ awk -F $'\t' '(NR == 1 || $3 ~ /^[^|]*good/)' file
col1 col2 col3
a a1 good | bad
c c1 foogood | foobad
e e1 good | good
Also, you can give sed
a try:
sed -n '1p; /^[^\t]*\t[^\t]*\t[^|]*good/p' file
which produces the desired result as well.
Upvotes: 1
Reputation: 12406
If the required text starts with good
, and is always followed by |
, then you can combine these into a single substring (with this order) directly in .contains()
. Pandas can filter out these rows.
Input DF
df
col1 col2 col3
0 a a1 good | bad
1 b b1 bad | bad
2 c c1 foogood | foobad
3 d d1 bad | good
4 e e1 good | good
5 f f1 bad | foobad
Filtered DF
df[df['col3'].str.contains('good | \n')]
col1 col2 col3
0 a a1 good | bad
2 c c1 foogood | foobad
4 e e1 good | good
You could do the same with bad
(instead of good
)
df[df['col3'].str.contains('bad | \n')]
col1 col2 col3
1 b b1 bad | bad
3 d d1 bad | good
5 f f1 bad | foobad
Upvotes: 1
Reputation: 323226
You can using str.split
with contains
yourdf=df[df.col3.str.split('|').str[0].str.contains('good')]
Upvotes: 2