sbajew
sbajew

Reputation: 35

pandas df filtering on partial string values across a number of columns

I've recently started working with pandas and I stumbled across (likely trivial) problem I can't solve. Tho it's easier to do this in pure Python script, I really want it in pandas. Here's my rookie question.

Having the dataframe below:

 ID Sample1 quality1    Sample2 quality2    Sample3 quality3
ID1 val str1,str2,str3@num  val str1,str2,str3@num  val str1,str2,str3@num
ID2 val str4,str5,str63@num val str4,str5,st63@num  val str4,str5,str63@num
ID3 val str1,str2,str3@num  val str1,str1,str3@num  val str4,str2,str3@num
ID4 val str1,str2,str3@num  val str2,str2,str3@num  val str1,str2,str3@num
ID5 val str4,str5,str63@num val str4,str5,st63@num  val str4,str5,str63@num

I want to write a function to keep only rows that have a minimum required number of quality scores across n columns. Only the first part of the string really matters, so first select only first part of the string:

 ID Sample1 quality1    Sample2 quality2    Sample3 quality3
ID1 val str1    val str1    val str1
ID2 val str4    val str4    val str4
ID3 val str1    val str1    val str4
ID4 val str1    val str2    val str1
ID5 val str4    val str3    val str4

Let's say I only want to keep rows with minimum score of "str4" across two columns I'd probably calculate the percentage across columns:

 ID Sample1 quality1    Sample2 quality2    Sample3 quality3
ID2 val str4    val str4    val str4
ID5 val str4    val str3    val str4

This is how I started playing with it just to know where things are but I still can't put things back together:

for i,rows in enumerate(table_test.values):
    min_val = "str4"
    scores = rows[2::2]
    lists = np.ndarray.tolist(scores)
    for list in lists:
        first_str = list.split(",")
        print(i, first_str[0])

Thanks for the thoughts or/and help!

Upvotes: 1

Views: 136

Answers (1)

jezrael
jezrael

Reputation: 863031

Use boolean indexing for filtering with boolean mask:

min_val = "str4"
df = df[df.filter(like='quality').apply(lambda x: x.str.startswith(min_val)).sum(axis=1) >= 2]
print (df)
    ID Sample1             quality1 Sample2            quality2 Sample3  \
1  ID2     val  str4,str5,str63@num     val  str4,str5,st63@num     val   
4  ID5     val  str4,str5,str63@num     val  str4,str5,st63@num     val   

              quality3  
1  str4,str5,str63@num  
4  str4,str5,str63@num  

Or:

min_val = "str4"
df = df[df.filter(like='quality').applymap(lambda x: x.startswith(min_val)).sum(axis=1) >= 2]
print (df)
    ID Sample1             quality1 Sample2            quality2 Sample3  \
1  ID2     val  str4,str5,str63@num     val  str4,str5,st63@num     val   
4  ID5     val  str4,str5,str63@num     val  str4,str5,st63@num     val   

              quality3  
1  str4,str5,str63@num  
4  str4,str5,str63@num  

Explanation:

First filter all columns with quality string:

print (df.filter(like='quality'))
              quality1            quality2             quality3
0   str1,str2,str3@num  str1,str2,str3@num   str1,str2,str3@num
1  str4,str5,str63@num  str4,str5,st63@num  str4,str5,str63@num
2   str1,str2,str3@num  str1,str1,str3@num   str4,str2,str3@num
3   str1,str2,str3@num  str2,str2,str3@num   str1,str2,str3@num
4  str4,str5,str63@num  str4,str5,st63@num  str4,str5,str63@num

Compare all columns by startswith for boolean DataFrame:

print (df.filter(like='quality').apply(lambda x: x.str.startswith(min_val)))
   quality1  quality2  quality3
0     False     False     False
1      True      True      True
2     False     False      True
3     False     False     False
4      True      True      True

Count Trues values by sum - Trues are processes like 1s:

print (df.filter(like='quality').apply(lambda x: x.str.startswith(min_val)).sum(axis=1))
0    0
1    3
2    1
3    0
4    3
dtype: int64

Compare by treshold:

print (df.filter(like='quality').apply(lambda x: x.str.startswith(min_val)).sum(axis=1) >=2)
0    False
1     True
2    False
3    False
4     True
dtype: bool

If want also split quality columns first split all quality columns and assign back:

min_val = "str4"
cols = df.filter(like='quality').columns

df[cols] = df[cols].apply(lambda x: x.str.split(',').str[0])
#another solution
#df[cols] = df[cols].applymap(lambda x: x.split(',')[0])
print (df)
    ID Sample1 quality1 Sample2 quality2 Sample3 quality3
0  ID1     val     str1     val     str1     val     str1
1  ID2     val     str4     val     str4     val     str4
2  ID3     val     str1     val     str1     val     str4
3  ID4     val     str1     val     str2     val     str1
4  ID5     val     str4     val     str4     val     str4

Then compare by min_val for boolean DataFrame and filter same way as before:

df = df[(df[cols] == min_val).sum(axis=1) >=2]
print (df)
    ID Sample1 quality1 Sample2 quality2 Sample3 quality3
1  ID2     val     str4     val     str4     val     str4
4  ID5     val     str4     val     str4     val     str4

Upvotes: 1

Related Questions