Reputation: 35
I have a massive dataframe. The dataframe has column patient.drug. This column contains list of dictionaries as its elements. I want to filter out all the rows that contain 'NIFEDIPINE' word in patient.drug column.
The dataframe is very large. Here is a sample of it.
patient.drug
0 [{'drugcharacterization': '1', 'medicinalproduct': 'PANDOL'}]
1 [{'drugcharacterization': '2', 'medicinalproduct': 'NIFEDIPINE'}]
2 [{'drugcharacterization': '3', 'medicinalproduct': 'SIMVASTATIN'}]
3 [{'drugcharacterization': '4', 'medicinalproduct': 'NIFEDIPINE'}]
so far, I have tried
df[df['patient.drug'].str.contains('NIFEDIPINE')]
but it is giving me an error.
raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Float64Index([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,\n ...\n nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],\n dtype='float64', length=12000)] are in the [columns]"
I have also tried using in
operator and iterating over rows.
lst=[]
for i in range(len(df)):
if 'NIFEDIPINE' in df.loc[i, "patirnt.drug"]:
lst.append(i)
print(lst)
Which is also causing an error. What should I do to get it right?
Upvotes: 3
Views: 1140
Reputation: 23217
Suppose you have this layout of column:
Search string 'NIFEDIPINE' found on the 2nd and 4th entries:
data = {'patient.drug':
[[{'drugcharacterization': '1', 'medicinalproduct': 'PANDOL'}],
[{'drugcharacterization': '2', 'medicinalproduct': 'NIFEDIPINE'}],
[{'drugcharacterization': '3', 'medicinalproduct': 'SIMVASTATIN'}],
[{'drugcharacterization': '4', 'medicinalproduct': 'NIFEDIPINE'}],
]
}
df = pd.DataFrame(data)
patient.drug
0 [{'drugcharacterization': '1', 'medicinalproduct': 'PANDOL'}]
1 [{'drugcharacterization': '2', 'medicinalproduct': 'NIFEDIPINE'}] <=== keyword here
2 [{'drugcharacterization': '3', 'medicinalproduct': 'SIMVASTATIN'}]
3 [{'drugcharacterization': '4', 'medicinalproduct': 'NIFEDIPINE'}] <=== keyword here
(Layout digged out from your previous questions)
Solution:
[Updated for 1) support more than one dict in a list and 2) partial string match].
Use: .loc
+ .explode()
+ .apply()
:
keyword = 'NIFEDIPINE'
df.loc[df['patient.drug'].explode().apply(lambda d: keyword in ' '.join(d.values())).any(level=0)]
Result:
Rows with keyword string 'NIFEDIPINE' correctly extracted and displayed:
patient.drug
1 [{'drugcharacterization': '2', 'medicinalproduct': 'NIFEDIPINE'}]
3 [{'drugcharacterization': '4', 'medicinalproduct': 'NIFEDIPINE'}]
Upvotes: 1
Reputation: 8816
After reproducing your Data,
>>> df
patient.drug
0 [{'drugcharacterization': '1', 'medicinalproduct': 'PANDOL'}]
1 [{'drugcharacterization': '2', 'medicinalproduct': 'NIFEDIPINE'}]
2 [{'drugcharacterization': '3', 'medicinalproduct': 'SIMVASTATIN'}]
3 [{'drugcharacterization': '3', 'medicinalproduct': 'SIMVASTATIN'}]
4 [{'drugcharacterization': '4', 'medicinalproduct': 'NIFEDIPINE'}]
While using Your code:
>>> df[df['patient.drug'].str.contains('NIFEDIPINE')]
Error:
raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Float64Index([nan, nan, nan, nan, nan], dtype='float64')] are in the [columns]"
>>> df[df['patient.drug'].astype('str').str.contains('NIFEDIPINE')]
patient.drug
1 [{'drugcharacterization': '2', 'medicinalproduct': 'NIFEDIPINE'}]
4 [{'drugcharacterization': '4', 'medicinalproduct': 'NIFEDIPINE'}]
Note:
This is raising issue due to indexer
check in the pandas indexer.py
section, which is as follows:
--> pandas/core/indexing.py
# Count missing values:
missing_mask = indexer < 0
missing = (missing_mask).sum()
if missing:
if missing == len(indexer):
axis_name = self.obj._get_axis_name(axis)
raise KeyError(f"None of [{key}] are in the [{axis_name}]")
# We (temporarily) allow for some missing keys with .loc, except in
# some cases (e.g. setting) in which "raise_missing" will be False
Upvotes: 1
Reputation: 83
You can use isin
drug_name = ['NIFEDIPINE']
df_NIFEDIPINE = df[df['patient.drug'].isin(drug_name)].reset_index()
Upvotes: 0
Reputation: 373
It was unclear if your column's each element is list of dictionary or just dictionary. Anyways, I have included solution for both cases.
import pandas as pd
a = [1, 2, 3, 4, 6]
b = [{'a':'A'}, {'b':'B'}, {'c':'C'}, {'d':'D'}, {'e':'E'}]
df = pd.DataFrame({'A': a, 'B': b})
df[df['B'].apply(lambda x: 'a' in x)]
This gives output as:
A B
1 {'a': 'A'}
In your case
df[df['B'].apply(lambda x: 'NIFEDIPINE' in x)]
import pandas as pd
a = [1, 2, 3, 4, 6]
b = [[{'a':'A'}], [{'b':'B'}], [{'c':'C'}], [{'d':'D'}], [{'e':'E'}]]
df = pd.DataFrame({'A': a, 'B': b})
def check(key, dict_list):
for map in dict_list:
if key in map:
return True
return False
df[df['B'].apply(lambda x: check('a', x))]
Upvotes: 0