Reputation: 1330
I have an SQL query which I want to implement it on the pandas dataframe data. This SQL is actually filtering the t_id if the e_count is null for more than 90 percent of the cases for type HIST. I have dataframe with all the columns but need to implement this logic in Python Pandas dataframe
Below sql query:-
SELECT *
FROM
(
SELECT
t_id,
s_id,
t_name,
type,
act_dt
FROM
tblstg
)t
WHERE t.t_id NOT IN
(
SELECT t_id FROM
(
SELECT t_id,CASE WHEN (CAST(SUM(CASE WHEN act_dt IS NULL THEN 1 ELSE 0 END) AS FLOAT)/count(*) * 100) > 90 THEN 1 ELSE 0 END AS removal_flg
FROM tblstg
WHERE type = 'HIST'
GROUP BY t_id
)st
where st.removal_flg = 1
Dataframe :
I have a pandas dataframe like
t_id s_id t_name type act_dt
T1 china android HIST jan
T1 mumbai android HIST feb
T1 dubai apple EXT
T2 japan nokia HIST
T3 japan apple HIST jan
T3 koria HIST
T3 japan1 apple HIST
T3 japan2 apple HIST
T3 japan3 apple HIST
T3 japan4 apple HIST
T3 japan5 apple HIST
T3 japan6 apple HIST
T3 japan7 apple HIST
T3 japan8 apple HIST
T3 dubai nokia EXT
final result:
t_id s_id t_name type act_dt
T1 china android HIST jan
T1 mumbai android HIST feb
T1 dubai apple EXT
i.e. drop T2 and T3 as 90% records for type HIST for that t_id for any s_id, act_dt is null.
I have written the below code to identify the t_id's that has got 90% of data for act_dt as null so i can drop it from main dataframe raw_data but it is giving error from 3rd line. How can I get the list of t_id's that meet the criteria?
I have written the below code to find out the t_id's that
raw_data['filter_dt'] = raw_data['act_dt'].isnull().astype(int)
filterrecords = raw_data[raw_data.type == 'HIST'].groupby("t_id").filter_dt.sum()
countoftids = raw_data[raw_data.type == 'HIST'].groupby("t_id").count()
finalflg = filterrecords / countoftids
finaltids = raw_data['t_id'][finalflg > 0.90]
Upvotes: 0
Views: 231
Reputation: 153460
Let's try this:
df.groupby('t_id').filter(lambda x: (x[x['type'] == "HIST"].act_dt.count() / x[x['type'] == "HIST"].act_dt.values.shape[0]) > .9)
OUtput:
t_id s_id t_name type act_dt
0 T1 china android HIST jan
1 T1 mumbai android HIST feb
2 T1 dubai apple EXT NaN
To see a list of unique t_id retained...
print(df.groupby('t_id').filter(lambda x: (x[x['type'] == "HIST"].act_dt.count() / x[x['type'] == "HIST"].act_dt.values.shape[0]) > .9)['t_id'].unique())
Output:
['T1']
Upvotes: 2