Reputation: 509
Sample data in .csv format
| No.| IP | Unix_time | # integer unix time
| 1 | 1.1.1.1 | 1563552000 | # equivalent to 12:00:00 AM
| 2 | 1.1.1.1 | 1563552030 | # equivalent to 12:00:30 AM
| 3 | 1.1.1.1 | 1563552100 | # equivalent to 12:01:40 AM
| 4 | 1.1.1.1 | 1563552110 | # equivalent to 12:01:50 AM
| 5 | 1.1.1.1 | 1563552180 | # equivalent to 12:03:00 AM
| 6 | 1.2.3.10 | 1563552120 |
Here's the current working code using pandas groupby( ) and get_group( ) functions:
data = pd.read_csv(some_path, header=0)
root = data.groupby('IP')
for a in root.groups.keys():
t = root.get_group(a)['Unix_time']
print(a + 'has' + t.count() + 'record')
You will see the results below:
1.1.1.1 has 5 record
1.2.3.10 has 1 record
Now, I want some improvement based on above code.
For the same IP value (e.g., 1.1.1.1), I want to make further sub-groups based on a maximum time interval (e.g., 60 seconds), and count the number of elements in each sub-group. For example, in above sample data:
Start from row 1: row 2 Unix_time value is within 60 seconds, but row 3 is beyond 60 seconds.
Thus, row 1-2 is a group, row 3-4 is a separate group, row 5 is a separate group. In other words, group '1.1.1.1' has 3 sub-groups now. The result should be:
1.1.1.1 start time 1563552000 has 2 record within 60 secs
1.1.1.1 start time 1563552100 has 2 record within 60 secs
1.1.1.1 start time 1563552150 has 1 record within 60 secs
1.2.3.10 start time 1563552120 has 1 record within 60 secs
How to make it?
Upvotes: 1
Views: 326
Reputation: 153460
You can use pd.Grouper
:
df['datetime'] = pd.to_datetime(df['Unix_time'], unit='s')
for n, g in df.groupby(['IP', pd.Grouper(freq='60s', key='datetime')]):
print(f'{n[0]} start time {g.iloc[0, g.columns.get_loc("Unix_time")]} has {len(g)} records within 60 secs')
Output:
1.1.1.1 start time 1563552000 has 2 records within 60 secs
1.1.1.1 start time 1563552100 has 2 records within 60 secs
1.1.1.1 start time 1563552150 has 1 records within 60 secs
1.2.3.10 start time 1563552120 has 1 records within 60 secs
root = df.groupby(['IP',df['Unix_time']//60])
for n, g in root:
print(f'{n[0]} start time {g.iloc[0, g.columns.get_loc("Unix_time")]} has {len(g)} records within 60 secs')
Output:
1.1.1.1 start time 1563552000 has 2 records within 60 secs
1.1.1.1 start time 1563552100 has 2 records within 60 secs
1.1.1.1 start time 1563552150 has 1 records within 60 secs
1.2.3.10 start time 1563552120 has 1 records within 60 secs
Upvotes: 3