Reputation: 25
I am adapting my data analysis pipeline from a wide to the tidy/long format right now and have a problem with filtering it and I just cannot wrap my head around it.
My data (simplified) looks like this (microscopy intensity data): in each measurement of a group I have several regions of interest = roi where I am looking at the intensity (=value) over several timepoints.
A roi is basically a single cell in the microscope image. I am following the change of intensity (=value) over time (=timepoint). I repeat this experiment a couple of times (=measurement) looking at several cells (=roi) each time.
My aim is to filter out those ROIs of a measurement for all timepoints, that have a intensity value higher than my set threshold at timepoint 0 (I considers these ROIs pre-activated).
data = { "timepoint": [0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3],
"measurement": [1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,3,3,3,3,3,3,3,3],
"roi":[1,1,1,1,2,2,2,2,3,3,3,3,1,1,1,1,1,1,1,1,2,2,2,2],
"value":[0.1,0.2,0.3,0.4,0.1,0.2,0.3,0.4,0.5,0.6,0.8,0.9,0.1,0.2,0.3,0.4,0.5,0.6,0.8,0.9,0.1,0.2,0.3,0.4],
"group": "control"
}
df = pd.DataFrame(data)
df
returns
timepoint measurement roi value group
0 0 1 1 0.1 control
1 1 1 1 0.2 control
2 2 1 1 0.3 control
3 3 1 1 0.4 control
4 0 1 2 0.1 control
5 1 1 2 0.2 control
6 2 1 2 0.3 control
7 3 1 2 0.4 control
8 0 1 3 0.5 control
9 1 1 3 0.6 control
10 2 1 3 0.8 control
11 3 1 3 0.9 control
12 0 2 1 0.1 control
13 1 2 1 0.2 control
14 2 2 1 0.3 control
15 3 2 1 0.4 control
16 0 3 1 0.5 control
17 1 3 1 0.6 control
18 2 3 1 0.8 control
19 3 3 1 0.9 control
20 0 3 2 0.1 control
21 1 3 2 0.2 control
22 2 3 2 0.3 control
23 3 3 2 0.4 control
Now I can select the rows containing ROIs with a value at timepoint 0 higher than my threshold with
threshold = 0.4
pre_activated = df.loc[(df['timepoint'] == 0) & (df['value'] > threshold)]
pre_activated
which returns
timepoint measurement roi value group
8 0 1 3 0.5 control
16 0 3 1 0.5 control
Now I want to filter out those cells (e.g. measurement 1, roi 3) for all timepoints 0 to 3 from the original dataframe df
- this is the point I am stuck right now.
If I use .isin
df.loc[~(df['measurement'].isin(pre_activated["measurement"]) & df['roi'].isin(pre_activated["roi"]))]
I'll get close but also everything with the measurement 1
and roi 1
pair is missing (so I assume it is a problem with the condition expression)
timepoint measurement roi value group
4 0 1 2 0.1 control
5 1 1 2 0.2 control
6 2 1 2 0.3 control
7 3 1 2 0.4 control
12 0 2 1 0.1 control
13 1 2 1 0.2 control
14 2 2 1 0.3 control
15 3 2 1 0.4 control
20 0 3 2 0.1 control
21 1 3 2 0.2 control
22 2 3 2 0.3 control
23 3 3 2 0.4 control
I know I can use .query
for at least a single measurement & roi pair
df[~df.isin(df.query('measurement == 1 & roi == 3'))]
which will bring somewhat close, although all integers are converted to float. In addition the "group" column is now NaN, which would make it difficult when there are multiple groups with multiple measurements and rois per dataframe
timepoint measurement roi value group
0 0.0 1.0 1.0 0.1 control
1 1.0 1.0 1.0 0.2 control
2 2.0 1.0 1.0 0.3 control
3 3.0 1.0 1.0 0.4 control
4 0.0 1.0 2.0 0.1 control
5 1.0 1.0 2.0 0.2 control
6 2.0 1.0 2.0 0.3 control
7 3.0 1.0 2.0 0.4 control
8 NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN NaN
10 NaN NaN NaN NaN NaN
11 NaN NaN NaN NaN NaN
12 0.0 2.0 1.0 0.1 control
13 1.0 2.0 1.0 0.2 control
14 2.0 2.0 1.0 0.3 control
15 3.0 2.0 1.0 0.4 control
16 0.0 3.0 1.0 0.5 control
17 1.0 3.0 1.0 0.6 control
18 2.0 3.0 1.0 0.8 control
19 3.0 3.0 1.0 0.9 control
20 0.0 3.0 2.0 0.1 control
21 1.0 3.0 2.0 0.2 control
22 2.0 3.0 2.0 0.3 control
23 3.0 3.0 2.0 0.4 control
I tried to use a dict that stores the measurement
:roi
pair to avoid any mixup but do not really know whether this would be useful:
msmt_list = pre_activated["measurement"].values
roi_list = pre_activated["roi"].values
mydict={}
for i in range(len(msmt_list)):
mydict[msmt_list[i]]=roi_list[i]
output
mydict
{1: 3, 3: 1}
What is the best way to achieve what I would like to do? I would appreciate any input, also in terms of efficency as I usually deal with 3-4 groups with 4-8 measurements and up to 200 ROIs each and usually 360 timepoints.
Thanks!
Edit: Just to clarify what my desired output dataframes should look like
´df_pre_activated´ (those are the "roi" that have a value higher than my threshold at timepoint 0)
timepoint measurement roi value group
8 0 1 3 0.5 control
9 1 1 3 0.6 control
10 2 1 3 0.8 control
11 3 1 3 0.9 control
16 0 3 1 0.5 control
17 1 3 1 0.6 control
18 2 3 1 0.8 control
19 3 3 1 0.9 control
´df_filtered´ (this is basically the initial ´df´ without the data in ´df_pre_activated´ shown above)
timepoint measurement roi value group
0 0 1 1 0.1 control
1 1 1 1 0.2 control
2 2 1 1 0.3 control
3 3 1 1 0.4 control
4 0 1 2 0.1 control
5 1 1 2 0.2 control
6 2 1 2 0.3 control
7 3 1 2 0.4 control
12 0 2 1 0.1 control
13 1 2 1 0.2 control
14 2 2 1 0.3 control
15 3 2 1 0.4 control
20 0 3 2 0.1 control
21 1 3 2 0.2 control
22 2 3 2 0.3 control
23 3 3 2 0.4 control
Upvotes: 1
Views: 815
Reputation: 25
thanks to @Jose A. Jimenez and @Vioxini for their answers. I went with Jose's suggestion and it gives me the output I wanted. I further improved performance using dask
inputdf.shape
(73124, 5)
Using only pandas:
import pandas as pd
threshold = 0.4
pre_activated_t0 = inputdf[(inputdf['timepoint'] == 0) & (inputdf['value'] > threshold)]
pre_activated = inputdf.merge(pre_activated_t0[["measurement", "roi"]], how="inner", on=["measurement", "roi"])
filtereddf = inputdf.merge(
pre_activated,
how="left",
on=["timepoint", "measurement", "roi", "value"],
)
filtereddf = filtereddf[pd.isna(filtereddf["group_y"])]
filtereddf.drop("group_y", axis=1, inplace=True)
filtereddf.columns = list(inputdf.columns)
takes 2min 9 seconds.
Now with dask
:
import dask.dataframe as dd
threshold = 0.4
pre_activated_t0 = inputdf[(inputdf['timepoint'] == 0) & (inputdf['value'] > threshold)]
pre_activated = inputdf.merge(pre_activated_t0[["measurement", "roi"]], how="inner", on=["measurement", "roi"])
input_dd = dd.from_pandas(inputdf, npartitions=3)
pre_dd = dd.from_pandas(pre_activated, npartitions=3)
merger = dd.merge(input_dd,pre_dd, how="left", on=["timepoint", "measurement", "roi", "value"])
filtereddf = merger.compute()
filtereddf = filtereddf[pd.isna(filtereddf["group_y"])]
filtereddf.drop("group_y", axis=1, inplace=True)
filtereddf.columns = list(inputdf.columns)
this now only takes 42.6 s :-)
It is my first time using dask so there might be some options I am not aware of that could further improve speed, but for now that is ok.
Thanks again for the help!
Edit:
I played around with the npartitions
option when converting the pandas dataframe
to a dask dataframe
and increasing it from 3 to npartitions=30
further improves performance: it now takes only 9.87 seconds.
Upvotes: 0
Reputation: 102
Simply as this:
In:
df[(df["measurement"] != 1) | (df["roi"] != 3)]
Out:
timepoint measurement roi value group
0 0 1 1 0.1 control
1 1 1 1 0.2 control
2 2 1 1 0.3 control
3 3 1 1 0.4 control
4 0 1 2 0.1 control
5 1 1 2 0.2 control
6 2 1 2 0.3 control
7 3 1 2 0.4 control
12 0 2 1 0.1 control
13 1 2 1 0.2 control
14 2 2 1 0.3 control
15 3 2 1 0.4 control
16 0 3 1 0.5 control
17 1 3 1 0.6 control
18 2 3 1 0.8 control
19 3 3 1 0.9 control
20 0 3 2 0.1 control
21 1 3 2 0.2 control
22 2 3 2 0.3 control
23 3 3 2 0.4 control
This happens due to mathematical logic thinking. You are thinking. Show me the dataframe where a is not 1 and b is not 3, which is the same as show me the dataframe where not a is 1 or b is 3, which removes both 1 and 3 from the dataframe.
You must use a is not 1 or b is not 3, which is the same as not a is 1 and b is not 3.
Hope this helped. In one line.
Edit: To remove both 1 : 3 and 3 : 1, use the AND condition with the both OR conditions:
df[((df["measurement"] != 1) | (df["roi"] != 3)) & ((df["measurement"] != 3) | (df["roi"] != 1))]
Edit2: To remove directly the filtered rows, you can use the inverse of first filtering and then removing.
In:
threshold = 0.4
full_activated = df5[(df5['timepoint'] != 0) | (df5['value'] < threshold)]
full_activated
Out:
timepoint measurement roi value group
0 0 1 1 0.1 control
1 1 1 1 0.2 control
2 2 1 1 0.3 control
3 3 1 1 0.4 control
4 0 1 2 0.1 control
5 1 1 2 0.2 control
6 2 1 2 0.3 control
7 3 1 2 0.4 control
9 1 1 3 0.6 control
10 2 1 3 0.8 control
11 3 1 3 0.9 control
12 0 2 1 0.1 control
13 1 2 1 0.2 control
14 2 2 1 0.3 control
15 3 2 1 0.4 control
17 1 3 1 0.6 control
18 2 3 1 0.8 control
19 3 3 1 0.9 control
20 0 3 2 0.1 control
21 1 3 2 0.2 control
22 2 3 2 0.3 control
23 3 3 2 0.4 control
Edit 3:
Multiple conditions
threshold = 0.4
full_activated = df5[((df5['timepoint'] != 0) | (df5['value'] < threshold)) & ((df5["measurement"] != 1) | (df5["roi"] != 3)) & ((df5["measurement"] != 3) | (df5["roi"] != 1)) & ((df5["measurement"] != 1) | (df5["roi"] != 1)) ]
full_activated
Output:
timepoint measurement roi value group
4 0 1 2 0.1 control
5 1 1 2 0.2 control
6 2 1 2 0.3 control
7 3 1 2 0.4 control
12 0 2 1 0.1 control
13 1 2 1 0.2 control
14 2 2 1 0.3 control
15 3 2 1 0.4 control
20 0 3 2 0.1 control
21 1 3 2 0.2 control
22 2 3 2 0.3 control
23 3 3 2 0.4 control
Upvotes: 0
Reputation:
A solution would be the following:
First, we calculate df_pre_activated_t0
by filtering df
with the condition:
threshold = 0.4
df_pre_activated_t0 = df[(df['timepoint'] == 0) & (df['value'] > threshold)]
df_pre_activated_t0
looks like this:
timepoint measurement roi value group
8 0 1 3 0.5 control
16 0 3 1 0.5 control
We calculate df_pre_activated
by merging df
and df_pre_activated_t0
(inner merge):
df_pre_activated = df.merge(
df_pre_activated_t0[["measurement", "roi"]], how="inner", on=["measurement", "roi"]
)
df_pre_activated
looks like this:
timepoint measurement roi value group
0 0 1 3 0.5 control
1 1 1 3 0.6 control
2 2 1 3 0.8 control
3 3 1 3 0.9 control
4 0 3 1 0.5 control
5 1 3 1 0.6 control
6 2 3 1 0.8 control
7 3 3 1 0.9 control
To calculate df_filtered
(df
without the rows of df_pre_activated
), we do a left merge between df
and df_pre_activated
and keep the rows where values are not in df_pre_activated
:
df_filtered = df.merge(
df_pre_activated,
how="left",
on=["timepoint", "measurement", "roi", "value"]
)
df_filtered = df_filtered[pd.isna(df_filtered["group_y"])]
df_filtered
looks like this:
timepoint measurement roi value group_x group_y
0 0 1 1 0.1 control NaN
1 1 1 1 0.2 control NaN
2 2 1 1 0.3 control NaN
3 3 1 1 0.4 control NaN
4 0 1 2 0.1 control NaN
5 1 1 2 0.2 control NaN
6 2 1 2 0.3 control NaN
7 3 1 2 0.4 control NaN
12 0 2 1 0.1 control NaN
13 1 2 1 0.2 control NaN
14 2 2 1 0.3 control NaN
15 3 2 1 0.4 control NaN
20 0 3 2 0.1 control NaN
21 1 3 2 0.2 control NaN
22 2 3 2 0.3 control NaN
23 3 3 2 0.4 control NaN
Finally, we drop the group_y column and we set the column names to their original values:
df_filtered.drop("group_y", axis=1, inplace=True)
df_filtered.columns = list(df.columns)
df_filtered
looks like this:
timepoint measurement roi value group
0 0 1 1 0.1 control
1 1 1 1 0.2 control
2 2 1 1 0.3 control
3 3 1 1 0.4 control
4 0 1 2 0.1 control
5 1 1 2 0.2 control
6 2 1 2 0.3 control
7 3 1 2 0.4 control
12 0 2 1 0.1 control
13 1 2 1 0.2 control
14 2 2 1 0.3 control
15 3 2 1 0.4 control
20 0 3 2 0.1 control
21 1 3 2 0.2 control
22 2 3 2 0.3 control
23 3 3 2 0.4 control
Upvotes: 2