How to implement SQL in Pandas dataframe?

Question

I have an SQL query which I want to implement it on the pandas dataframe data. This SQL is actually filtering the t_id if the e_count is null for more than 90 percent of the cases for type HIST. I have dataframe with all the columns but need to implement this logic in Python Pandas dataframe

Below sql query:-

SELECT * 
FROM 
(
    SELECT 
        t_id,
        s_id,
        t_name,
         type,
        act_dt
    FROM
        tblstg 
)t
WHERE t.t_id NOT IN
(
    SELECT t_id FROM
     (   
        SELECT t_id,CASE WHEN (CAST(SUM(CASE WHEN act_dt IS NULL THEN 1 ELSE 0 END) AS FLOAT)/count(*) * 100) > 90 THEN 1 ELSE 0 END AS removal_flg
        FROM tblstg
        WHERE type = 'HIST'
        GROUP BY t_id 
     )st
     where st.removal_flg = 1

Dataframe :

I have a pandas dataframe like

t_id   s_id      t_name    type   act_dt
    T1     china     android   HIST   jan
    T1     mumbai    android   HIST   feb
    T1     dubai     apple     EXT  
    T2     japan     nokia     HIST   
    T3     japan     apple     HIST   jan
    T3     koria     HIST   
    T3     japan1    apple     HIST   
    T3     japan2    apple     HIST 
    T3     japan3    apple     HIST 
    T3     japan4    apple     HIST 
   T3     japan5    apple     HIST 
   T3     japan6    apple     HIST 
   T3     japan7    apple     HIST 
   T3     japan8    apple     HIST 
    T3     dubai     nokia     EXT

final result:

t_id   s_id      t_name    type   act_dt
T1     china     android   HIST   jan
T1     mumbai    android   HIST   feb
T1     dubai     apple     EXT

i.e. drop T2 and T3 as 90% records for type HIST for that t_id for any s_id, act_dt is null.

I have written the below code to identify the t_id's that has got 90% of data for act_dt as null so i can drop it from main dataframe raw_data but it is giving error from 3rd line. How can I get the list of t_id's that meet the criteria?

I have written the below code to find out the t_id's that

raw_data['filter_dt'] = raw_data['act_dt'].isnull().astype(int)
            
filterrecords = raw_data[raw_data.type == 'HIST'].groupby("t_id").filter_dt.sum()

countoftids = raw_data[raw_data.type == 'HIST'].groupby("t_id").count()

finalflg = filterrecords / countoftids

finaltids = raw_data['t_id'][finalflg > 0.90]

Scott Boston · Accepted Answer

Let's try this:

df.groupby('t_id').filter(lambda x: (x[x['type'] == "HIST"].act_dt.count() / x[x['type'] == "HIST"].act_dt.values.shape[0]) > .9)

OUtput:

  t_id    s_id   t_name  type act_dt
0   T1   china  android  HIST    jan
1   T1  mumbai  android  HIST    feb
2   T1   dubai    apple   EXT    NaN

To see a list of unique t_id retained...

print(df.groupby('t_id').filter(lambda x: (x[x['type'] == "HIST"].act_dt.count() / x[x['type'] == "HIST"].act_dt.values.shape[0]) > .9)['t_id'].unique())

Output:

['T1']

How to implement SQL in Pandas dataframe?

Answers (1)

Related Questions