Muhammad Asif Khan
Muhammad Asif Khan

Reputation: 319

Group by time, and then count unique entries only if these existed in a list [Panda]

Consider the following panda data frame "df" and python list "my_list" as following.

df =

timestamp  address    type
1           1          A
2           9          B
3           3          A
4           6          B
5           6          B
6           2          B
7           3          A
8           2          B
9           1          B
10          3          A
11          3          A
12          3          A

my_list =

[1, 2, 3]

Now what I want is to Group the data frame by timestamp column in 3-seconds bin and count the number of unique "types" only if address is present in "my_list".

The expected output should look like:

timestamp   A    B    
1           2    0 #One "B" ignored, because address=9 is not in my_list
4           0    1 #Two "B" ignored because address is not in "my_list
7           1    2 #Two "B" with unique addresses, and one "A"
10          1    0 #Three rows with Type="A", but addresses are is same.

Please be noted that the timestamp values are originally in timestamp format and we can apply df.groupby and pd.TimeGrouper functions to group rows in 3 seconds columns.

Appreciate only Pandas (Python) based answers.

Apologies for any confusion. I tried to keep it simple.

-- Khan

Upvotes: 2

Views: 571

Answers (3)

piRSquared
piRSquared

Reputation: 294358

Using pd.get_dummies

grps = df.timestamp.sub(1).floordiv(3).mul(3).add(1)
dups = df[['address', 'type']].assign(grps=grps).duplicated().values
inmy = df.address.isin(my_list).values

pd.get_dummies(df.set_index(grps)[inmy & ~dups].type).sum(level=0).reset_index()

   timestamp  A  B
0          1  2  0
1          4  0  1
2          7  1  2
3         10  1  0

Upvotes: 3

jezrael
jezrael

Reputation: 862921

Use:

#convert index to triples
df.index = df.index // 3
#filter rows by condition
df1 = df[df['address'].isin(my_list)]
#get unique numbers and reshape
df1 = df1['address'].groupby([df1.index, df1['type']]).nunique().unstack(fill_value=0)
#add timestamps
df1.index = df['timestamp'].groupby(df.index).first()
print (df1)
type       A  B
timestamp      
1          2  0
4          0  1
7          1  2
10         1  0

Setup:

print (df)
    timestamp  address type
0           1        1    A
1           2        9    B
2           3        3    A
3           4        6    B
4           5        6    B
5           6        2    B
6           7        3    A
7           8        2    B
8           9        1    B
9          10        3    A
10         11        3    A
11         12        3    A

Solution with datetimes is simplier:

#sample datetimes 
df['timestamp'] = pd.to_datetime(df['timestamp'], unit='D',
                   origin=pd.Timestamp('2017-01-01'))

print (df)
    timestamp  address type
0  2017-01-02        1    A
1  2017-01-03        9    B
2  2017-01-04        3    A
3  2017-01-05        6    B
4  2017-01-06        6    B
5  2017-01-07        2    B
6  2017-01-08        3    A
7  2017-01-09        2    B
8  2017-01-10        1    B
9  2017-01-11        3    A
10 2017-01-12        3    A
11 2017-01-13        3    A

df1 = df[df['address'].isin(my_list)]
df1 = (df1.groupby([pd.Grouper(freq='3D', key='timestamp'), 'type'])['address']
          .nunique()
          .unstack(fill_value=0) )
print (df1)
type        A  B
timestamp       
2017-01-02  2  0
2017-01-05  0  1
2017-01-08  1  2
2017-01-11  1  0

And one row solution:

df1 = (df.query("address in @my_list")
         .groupby([pd.Grouper(freq='3D', key='timestamp'), 'type'])['address']
         .nunique()
         .unstack(fill_value=0))
print (df1)
type        A  B
timestamp       
2017-01-02  2  0
2017-01-05  0  1
2017-01-08  1  2
2017-01-11  1  0

Upvotes: 5

Bharath M Shetty
Bharath M Shetty

Reputation: 30605

Heres one approach by creating reference columns , pivot_table i.e

# Group every three column by finding %3 and cumcount 
df['temp'] = df.groupby([df['timestamp']%3]).cumcount()

# Replace the values absent in list by nan
df['add'] = df['address'].where(df['address'].isin(li),np.nan)

# Create the index of time stamp whos mod value is 1 
idx = df['timestamp'][df['timestamp']%3==1]  

# Pivot table with agg function number of unqiue values based on newly created columns, fill nan with 0. 
ndf = df.pivot_table('add','type','temp',aggfunc='nunique',fill_value=0).T.set_index(idx)

Output :

type       A  B
timestamp      
1          2  0
4          0  1
7          1  2
10         1  0

Upvotes: 3

Related Questions