Reputation: 3086
I'm working on text mining problem and using Pandas for text processing. From the following example I need to pick only those row which have the max span (start
- end
) within the same category (cat
)
Given this dataframe:
name start end cat
0 coumadin 0 8 DRUG
1 albuterol 18 27 DRUG
2 albuterol sulfate 18 35 DRUG
3 sulfate 28 35 DRUG
4 2.5 36 39 STRENGTH
5 2.5 mg 36 42 STRENGTH
6 2.5 mg /3 ml 36 48 STRENGTH
7 0.083 50 55 STRENGTH
8 0.083 % 50 57 STRENGTH
9 2.5 mg /3 ml (0.083 %) 36 58 STRENGTH
10 solution 59 67 FORM
11 solution for nebulization 59 84 FORM
12 nebulization 72 84 ROUTE
13 one (1) 90 97 FREQUENCY
14 neb 98 101 ROUTE
15 neb inhalation 98 112 ROUTE
16 inhalation 102 112 ROUTE
17 q4h 113 116 FREQUENCY
18 every 118 123 FREQUENCY
19 every 4 hours 118 131 FREQUENCY
20 q4h (every 4 hours) 113 132 FREQUENCY
21 q4h (every 4 hours) as needed 113 142 FREQUENCY
22 as needed 133 142 FREQUENCY
23 dyspnea 147 154 REASON
I need to get the following:
name start end cat
0 coumadin 0 8 DRUG
2 albuterol sulfate 18 35 DRUG
9 2.5 mg /3 ml (0.083 %) 36 58 STRENGTH
11 solution for nebulization 59 84 FORM
13 one (1) 90 97 FREQUENCY
15 neb inhalation 98 112 ROUTE
21 q4h (every 4 hours) as needed 113 142 FREQUENCY
23 dyspnea 147 154 REASON
What I tried is to groupby
by the category and then compute the max difference (end
-start
). However I got stuck how to find the max span between for the same entity within the category. I guess it should not be very tricky
COMMENT
Thank you all for suggestions, but I need ALL possible entities within each category. For example, in DRUG, there are two relevant drugs: coumadin
and albuterol sulfate
, and some fractions of them (albuterol
and sulfate
). I need to remove only (albuterol
and sulfate
) while keeping coumadin
and albuterol sulfate
. The same logic for other categories.
For example, rows 4-8 are all bits of a complete row 9, thus I need to keep only row 9. Rows 1 and 3 are parts of the row 2, thus I need to keep row 2 (in addition to row 0). Etc.
Obviously, all constituents ('bits') are within the max range, but the problem is to find the max (or unifying range) of the same entity and its constituents)
COMMENT 2
A possible solution could be: to find all overlapping intervals within the same category cat
and pick the largest. I'm trying to implement, but not luck so far.
Possible Solution
I sorted columns by ascending and descending order:
df.sort_values(by=[1,2], ascending=[True, False])
0 1 2 3
0 coumadin 0 8 DRUG
2 albuterol sulfate 18 35 DRUG
1 albuterol 18 27 DRUG
3 sulfate 28 35 DRUG
9 2.5 mg /3 ml (0.083 %) 36 58 STRENGTH
6 2.5 mg /3 ml 36 48 STRENGTH
5 2.5 mg 36 42 STRENGTH
4 2.5 36 39 STRENGTH
8 0.083 % 50 57 STRENGTH
7 0.083 50 55 STRENGTH
11 solution for nebulization 59 84 FORM
10 solution 59 67 FORM
12 nebulization 72 84 ROUTE
13 one (1) 90 97 FREQUENCY
15 neb inhalation 98 112 ROUTE
14 neb 98 101 ROUTE
16 inhalation 102 112 ROUTE
21 q4h (every 4 hours) as needed 113 142 FREQUENCY
20 q4h (every 4 hours) 113 132 FREQUENCY
17 q4h 113 116 FREQUENCY
19 every 4 hours 118 131 FREQUENCY
18 every 118 123 FREQUENCY
22 as needed 133 142 FREQUENCY
23 dyspnea 147 154 REASON
Which puts the relevant row the first, however, I still need to filter out irrelevant rows....
Upvotes: 0
Views: 85
Reputation: 4100
I have tried this on sample of your df:
Create a sample df:
import pandas as pd
Name = ['coumadin','albuterol','albuterol sulfate','sulfate']
Cat = ['D', 'D', 'D', 'D']
Start = [0, 18, 18, 28]
End = [8, 27, 33,35]
ID = [1,2,3,4]
df = pd.DataFrame(data = list(zip(ID,Name,Start,End,Cat)), \
columns=['ID','Name','Start','End','Cat'])
Make a function which will help in identifying the names which are similar
def matcher(x):
res = df.loc[df['Name'].str.contains(x, regex=False, case=False), 'ID']
return ','.join(res.astype(str))
Applying this function to value of the column
df['Matches'] = df['Name'].apply(matcher) ##Matches will contain the ID of rows which are similar and have only 1 value which are absolute.
ID Name Start End Cat Matches
0 1 coumadin 0 8 D 1
1 2 albuterol 18 27 D 2,3
2 3 albuterol sulfate 18 33 D 3
3 4 sulfate 28 35 D 3,4
Count the number of rows getting in matches
df['Count'] = df.Matches.apply(lambda x: len(x.split(',')))
Keep the df which has "Count" as 1 as these are the rows which contains the other rows:
df = df[df.Count == 1]
ID Name Start End Cat Matches Count
0 1 coumadin 0 8 D 1 1
2 3 albuterol sulfate 18 33 D 3 1
You can then remove unnecessary columns :)
Upvotes: 1