Reputation: 35
I've recently started working with pandas and I stumbled across (likely trivial) problem I can't solve. Tho it's easier to do this in pure Python script, I really want it in pandas. Here's my rookie question.
Having the dataframe below:
ID Sample1 quality1 Sample2 quality2 Sample3 quality3
ID1 val str1,str2,str3@num val str1,str2,str3@num val str1,str2,str3@num
ID2 val str4,str5,str63@num val str4,str5,st63@num val str4,str5,str63@num
ID3 val str1,str2,str3@num val str1,str1,str3@num val str4,str2,str3@num
ID4 val str1,str2,str3@num val str2,str2,str3@num val str1,str2,str3@num
ID5 val str4,str5,str63@num val str4,str5,st63@num val str4,str5,str63@num
I want to write a function to keep only rows that have a minimum required number of quality scores across n columns. Only the first part of the string really matters, so first select only first part of the string:
ID Sample1 quality1 Sample2 quality2 Sample3 quality3
ID1 val str1 val str1 val str1
ID2 val str4 val str4 val str4
ID3 val str1 val str1 val str4
ID4 val str1 val str2 val str1
ID5 val str4 val str3 val str4
Let's say I only want to keep rows with minimum score of "str4" across two columns I'd probably calculate the percentage across columns:
ID Sample1 quality1 Sample2 quality2 Sample3 quality3
ID2 val str4 val str4 val str4
ID5 val str4 val str3 val str4
This is how I started playing with it just to know where things are but I still can't put things back together:
for i,rows in enumerate(table_test.values):
min_val = "str4"
scores = rows[2::2]
lists = np.ndarray.tolist(scores)
for list in lists:
first_str = list.split(",")
print(i, first_str[0])
Thanks for the thoughts or/and help!
Upvotes: 1
Views: 136
Reputation: 863031
Use boolean indexing
for filtering with boolean mask:
min_val = "str4"
df = df[df.filter(like='quality').apply(lambda x: x.str.startswith(min_val)).sum(axis=1) >= 2]
print (df)
ID Sample1 quality1 Sample2 quality2 Sample3 \
1 ID2 val str4,str5,str63@num val str4,str5,st63@num val
4 ID5 val str4,str5,str63@num val str4,str5,st63@num val
quality3
1 str4,str5,str63@num
4 str4,str5,str63@num
Or:
min_val = "str4"
df = df[df.filter(like='quality').applymap(lambda x: x.startswith(min_val)).sum(axis=1) >= 2]
print (df)
ID Sample1 quality1 Sample2 quality2 Sample3 \
1 ID2 val str4,str5,str63@num val str4,str5,st63@num val
4 ID5 val str4,str5,str63@num val str4,str5,st63@num val
quality3
1 str4,str5,str63@num
4 str4,str5,str63@num
Explanation:
First filter
all columns with quality
string:
print (df.filter(like='quality'))
quality1 quality2 quality3
0 str1,str2,str3@num str1,str2,str3@num str1,str2,str3@num
1 str4,str5,str63@num str4,str5,st63@num str4,str5,str63@num
2 str1,str2,str3@num str1,str1,str3@num str4,str2,str3@num
3 str1,str2,str3@num str2,str2,str3@num str1,str2,str3@num
4 str4,str5,str63@num str4,str5,st63@num str4,str5,str63@num
Compare all columns by startswith
for boolean DataFrame
:
print (df.filter(like='quality').apply(lambda x: x.str.startswith(min_val)))
quality1 quality2 quality3
0 False False False
1 True True True
2 False False True
3 False False False
4 True True True
Count True
s values by sum
- True
s are processes like 1
s:
print (df.filter(like='quality').apply(lambda x: x.str.startswith(min_val)).sum(axis=1))
0 0
1 3
2 1
3 0
4 3
dtype: int64
Compare by treshold:
print (df.filter(like='quality').apply(lambda x: x.str.startswith(min_val)).sum(axis=1) >=2)
0 False
1 True
2 False
3 False
4 True
dtype: bool
If want also split quality
columns first split
all quality
columns and assign back:
min_val = "str4"
cols = df.filter(like='quality').columns
df[cols] = df[cols].apply(lambda x: x.str.split(',').str[0])
#another solution
#df[cols] = df[cols].applymap(lambda x: x.split(',')[0])
print (df)
ID Sample1 quality1 Sample2 quality2 Sample3 quality3
0 ID1 val str1 val str1 val str1
1 ID2 val str4 val str4 val str4
2 ID3 val str1 val str1 val str4
3 ID4 val str1 val str2 val str1
4 ID5 val str4 val str4 val str4
Then compare by min_val
for boolean DataFrame and filter same way as before:
df = df[(df[cols] == min_val).sum(axis=1) >=2]
print (df)
ID Sample1 quality1 Sample2 quality2 Sample3 quality3
1 ID2 val str4 val str4 val str4
4 ID5 val str4 val str4 val str4
Upvotes: 1