shireham
shireham

Reputation: 13

Spotfire box plot of top 10 values to date

I am trying to make a box plot of data in Spotfire that tracks the median/distribution for the top 10 most active compounds discovered in a drug discover project to date. For example, imagine we had the following (abbreviated) data table:

Cmpd_ID, Potency, Date_Collected
Cmpd_001, 4.6, 2023-09-28
Cmpd_002, 4.7, 2023-09-28
...
Cmpd_026, 6.5, 2024-01-26
Cmpd_027, 6.4, 2024-01-26
...
Cmpd_052, 5.6, 2024-02-25
Cmpd_053, 5.2, 2024-02-25
...
Cmpd_179, 8.7, 2024-04-19
Cmpd_180, 7.2, 2024-04-19
Cmpd_181, 7.5, 2024-04-19

As you can see, I have a list of compounds ordered by the date in which their data was collected. I would like to generate a box for each collection date which shows the mean/distribution for the top 10 compounds collected to date. So for the first box (2023-09-28), the 10 most potent compounds would be identified from just that first batch. The second box (2024-01-26) would identify compounds from both the first batch and the second batch. The last box (2024-04-19) would identify the top 10 compounds across all batches collected to date. The result would look something like the below plot.

box plot

Any ideas how to accomplish this?

I can find values like "Max" with a custom y-axis expression like: Max([Potency]) OVER (AllPrevious([Axis.X]))

...but I can't seem to just return the top 10 compounds discovered to date

Upvotes: 0

Views: 92

Answers (1)

Gaia Paolini
Gaia Paolini

Reputation: 1462

For every date you want to collect a different set of compounds. I can see a way of doing it date by date, if every time you position yourself on the chosen date, and collect the top 10 compounds. But you would only see one date at a time.

Here it goes:

  1. create a document property "testDate" that contain all the possible values of the "Date_Collected" column

  2. create a calculated column [Rank to Date] as:

    case when Integer(DateDiff('day',[Date_Collected],Date("${testDate}")))>=0 then DenseRank([Potency],'desc') end

  3. create a calculated column [Include] as:

    ([Rank to Date] is not Null) and ([Rank to Date]<=10)

Then build the box plot with [Include] on the X axis and [Potency] on the Y axis. Set Limit by expression to [Include].

Otherwise, it would work with a data function. I don't know how many compounds and dates you have in your dataset, so this solution might not be very fast.

Here it is (Python):

  1. input is your dataset ("data")

  2. output is a new table (here called "collected_candidates") which contains the top 10 compounds for each selected date ("Today").

  3. You then plot the new data table with [Today] on the X axis and [Potency] on the Y axis.

     import pandas as pd
     #Turn data type to date
     data['Date_Collected'] = pd.to_datetime(data['Date_Collected'], format='%d%b%Y')       
     #For each date, collect the top N compounds
     N=10
     collected_candidates_list =[]
     for dd in data['Date_Collected'].unique():
         data_dd = data.loc[data['Date_Collected']<=dd]
         data_dd['Potency_rank']=data_dd['Potency'].rank(method='dense', ascending=False)
         data_dd['Today']=dd
         collected_candidates_list.append(data_dd.loc[data_dd['Potency_rank']<=N,['Today','Cmpd_ID','Potency']])   
     collected_candidates=pd.concat(collected_candidates_list)
    

Upvotes: 0

Related Questions