Reputation: 319
Consider the following panda data frame "df" and python list "my_list" as following.
df =
timestamp address type
1 1 A
2 9 B
3 3 A
4 6 B
5 6 B
6 2 B
7 3 A
8 2 B
9 1 B
10 3 A
11 3 A
12 3 A
my_list =
[1, 2, 3]
Now what I want is to Group the data frame by timestamp column in 3-seconds bin and count the number of unique "types" only if address is present in "my_list".
The expected output should look like:
timestamp A B
1 2 0 #One "B" ignored, because address=9 is not in my_list
4 0 1 #Two "B" ignored because address is not in "my_list
7 1 2 #Two "B" with unique addresses, and one "A"
10 1 0 #Three rows with Type="A", but addresses are is same.
Please be noted that the timestamp values are originally in timestamp format and we can apply df.groupby and pd.TimeGrouper functions to group rows in 3 seconds columns.
Appreciate only Pandas (Python) based answers.
Apologies for any confusion. I tried to keep it simple.
-- Khan
Upvotes: 2
Views: 571
Reputation: 294358
Using pd.get_dummies
grps = df.timestamp.sub(1).floordiv(3).mul(3).add(1)
dups = df[['address', 'type']].assign(grps=grps).duplicated().values
inmy = df.address.isin(my_list).values
pd.get_dummies(df.set_index(grps)[inmy & ~dups].type).sum(level=0).reset_index()
timestamp A B
0 1 2 0
1 4 0 1
2 7 1 2
3 10 1 0
Upvotes: 3
Reputation: 862921
Use:
#convert index to triples
df.index = df.index // 3
#filter rows by condition
df1 = df[df['address'].isin(my_list)]
#get unique numbers and reshape
df1 = df1['address'].groupby([df1.index, df1['type']]).nunique().unstack(fill_value=0)
#add timestamps
df1.index = df['timestamp'].groupby(df.index).first()
print (df1)
type A B
timestamp
1 2 0
4 0 1
7 1 2
10 1 0
Setup:
print (df)
timestamp address type
0 1 1 A
1 2 9 B
2 3 3 A
3 4 6 B
4 5 6 B
5 6 2 B
6 7 3 A
7 8 2 B
8 9 1 B
9 10 3 A
10 11 3 A
11 12 3 A
Solution with datetimes
is simplier:
#sample datetimes
df['timestamp'] = pd.to_datetime(df['timestamp'], unit='D',
origin=pd.Timestamp('2017-01-01'))
print (df)
timestamp address type
0 2017-01-02 1 A
1 2017-01-03 9 B
2 2017-01-04 3 A
3 2017-01-05 6 B
4 2017-01-06 6 B
5 2017-01-07 2 B
6 2017-01-08 3 A
7 2017-01-09 2 B
8 2017-01-10 1 B
9 2017-01-11 3 A
10 2017-01-12 3 A
11 2017-01-13 3 A
df1 = df[df['address'].isin(my_list)]
df1 = (df1.groupby([pd.Grouper(freq='3D', key='timestamp'), 'type'])['address']
.nunique()
.unstack(fill_value=0) )
print (df1)
type A B
timestamp
2017-01-02 2 0
2017-01-05 0 1
2017-01-08 1 2
2017-01-11 1 0
And one row solution:
df1 = (df.query("address in @my_list")
.groupby([pd.Grouper(freq='3D', key='timestamp'), 'type'])['address']
.nunique()
.unstack(fill_value=0))
print (df1)
type A B
timestamp
2017-01-02 2 0
2017-01-05 0 1
2017-01-08 1 2
2017-01-11 1 0
Upvotes: 5
Reputation: 30605
Heres one approach by creating reference columns , pivot_table
i.e
# Group every three column by finding %3 and cumcount
df['temp'] = df.groupby([df['timestamp']%3]).cumcount()
# Replace the values absent in list by nan
df['add'] = df['address'].where(df['address'].isin(li),np.nan)
# Create the index of time stamp whos mod value is 1
idx = df['timestamp'][df['timestamp']%3==1]
# Pivot table with agg function number of unqiue values based on newly created columns, fill nan with 0.
ndf = df.pivot_table('add','type','temp',aggfunc='nunique',fill_value=0).T.set_index(idx)
Output :
type A B timestamp 1 2 0 4 0 1 7 1 2 10 1 0
Upvotes: 3