Arnold Klein
Arnold Klein

Reputation: 3086

Pandas filtering on max range

I'm working on text mining problem and using Pandas for text processing. From the following example I need to pick only those row which have the max span (start - end) within the same category (cat)

Given this dataframe:

                             name  start  end        cat
0                        coumadin      0    8       DRUG
1                       albuterol     18   27       DRUG
2               albuterol sulfate     18   35       DRUG
3                         sulfate     28   35       DRUG
4                             2.5     36   39   STRENGTH
5                          2.5 mg     36   42   STRENGTH
6                    2.5 mg /3 ml     36   48   STRENGTH
7                           0.083     50   55   STRENGTH
8                         0.083 %     50   57   STRENGTH
9          2.5 mg /3 ml (0.083 %)     36   58   STRENGTH
10                       solution     59   67       FORM
11      solution for nebulization     59   84       FORM
12                   nebulization     72   84      ROUTE
13                        one (1)     90   97  FREQUENCY
14                            neb     98  101      ROUTE
15                 neb inhalation     98  112      ROUTE
16                     inhalation    102  112      ROUTE
17                            q4h    113  116  FREQUENCY
18                          every    118  123  FREQUENCY
19                  every 4 hours    118  131  FREQUENCY
20            q4h (every 4 hours)    113  132  FREQUENCY
21  q4h (every 4 hours) as needed    113  142  FREQUENCY
22                      as needed    133  142  FREQUENCY
23                        dyspnea    147  154     REASON

I need to get the following:

            name           start   end       cat
0                coumadin      0    8       DRUG
2   albuterol sulfate         18    35      DRUG
9   2.5 mg /3 ml (0.083 %)    36    58  STRENGTH
11  solution for nebulization 59    84      FORM
13                one (1)     90    97  FREQUENCY
15  neb inhalation            98    112    ROUTE
21  q4h (every 4 hours) as needed   113 142 FREQUENCY
23  dyspnea                  147    154   REASON

What I tried is to groupby by the category and then compute the max difference (end-start). However I got stuck how to find the max span between for the same entity within the category. I guess it should not be very tricky

COMMENT

Thank you all for suggestions, but I need ALL possible entities within each category. For example, in DRUG, there are two relevant drugs: coumadin and albuterol sulfate, and some fractions of them (albuterol and sulfate). I need to remove only (albuterol and sulfate) while keeping coumadin and albuterol sulfate. The same logic for other categories.

For example, rows 4-8 are all bits of a complete row 9, thus I need to keep only row 9. Rows 1 and 3 are parts of the row 2, thus I need to keep row 2 (in addition to row 0). Etc.

Obviously, all constituents ('bits') are within the max range, but the problem is to find the max (or unifying range) of the same entity and its constituents)

COMMENT 2

A possible solution could be: to find all overlapping intervals within the same category cat and pick the largest. I'm trying to implement, but not luck so far.

Possible Solution

I sorted columns by ascending and descending order:

df.sort_values(by=[1,2], ascending=[True, False])

                           0           1     2     3
0   coumadin                           0     8  DRUG
2   albuterol sulfate                 18    35  DRUG
1   albuterol                         18    27  DRUG
3   sulfate                           28    35  DRUG
9   2.5 mg /3 ml (0.083 %)            36    58  STRENGTH
6   2.5 mg /3 ml                      36    48  STRENGTH
5   2.5 mg                            36    42  STRENGTH
4   2.5                               36    39  STRENGTH
8   0.083 %                           50    57  STRENGTH
7   0.083                             50    55  STRENGTH
11  solution for nebulization         59    84  FORM
10  solution                          59    67  FORM
12  nebulization                      72    84  ROUTE
13  one (1)                           90    97  FREQUENCY
15  neb inhalation                    98    112 ROUTE
14  neb                               98    101 ROUTE
16  inhalation                       102    112 ROUTE
21  q4h (every 4 hours) as needed    113    142 FREQUENCY
20  q4h (every 4 hours)              113    132 FREQUENCY
17  q4h                              113    116 FREQUENCY
19  every 4 hours                    118    131 FREQUENCY
18  every                            118    123 FREQUENCY
22  as needed                        133    142 FREQUENCY
23  dyspnea                          147    154 REASON

Which puts the relevant row the first, however, I still need to filter out irrelevant rows....

Upvotes: 0

Views: 85

Answers (1)

Rahul Agarwal
Rahul Agarwal

Reputation: 4100

I have tried this on sample of your df:

Create a sample df:

import pandas as pd 
Name = ['coumadin','albuterol','albuterol sulfate','sulfate']
Cat = ['D', 'D', 'D', 'D']
Start = [0, 18, 18, 28]
End = [8, 27, 33,35]
ID = [1,2,3,4]
df = pd.DataFrame(data = list(zip(ID,Name,Start,End,Cat)), \
                  columns=['ID','Name','Start','End','Cat'])

Make a function which will help in identifying the names which are similar

def matcher(x):
    res = df.loc[df['Name'].str.contains(x, regex=False, case=False), 'ID']
    return ','.join(res.astype(str))

Applying this function to value of the column

df['Matches'] = df['Name'].apply(matcher) ##Matches will contain the ID of rows which are similar and have only 1 value which are absolute.

       ID               Name  Start  End Cat Matches
    0   1           coumadin      0    8   D       1
    1   2          albuterol     18   27   D     2,3
    2   3  albuterol sulfate     18   33   D       3
    3   4            sulfate     28   35   D     3,4 

Count the number of rows getting in matches

df['Count'] = df.Matches.apply(lambda x: len(x.split(',')))

Keep the df which has "Count" as 1 as these are the rows which contains the other rows:

df = df[df.Count == 1]

ID               Name  Start  End Cat Matches  Count
0   1           coumadin      0    8   D       1      1
2   3  albuterol sulfate     18   33   D       3      1

You can then remove unnecessary columns :)

Upvotes: 1

Related Questions