Reputation: 887
Ok so I have a first dataframe df1:
|timestamp |ip |
|2022-01-06 11:58:53+00:00|1.1.1.5. |
|2022-01-08 03:56:35+00:00|10.10.10.24|
|2022-01-09 22:29:30+00:00|3.3.3.89. |
|2022-03-08 22:37:52+00:00|8.8.8.88. |
And a second dataframe, df2:
|timestamp |other|
|2022-01-07 22:08:59+00:00|other|
|2022-01-07 23:08:59+00:00|other|
|2022-01-09 17:04:09+00:00|other|
|2022-03-05 17:04:09+00:00|other|
And I would like to count how many rows there is in df2 depending on the 2 consecutive timestamps in df1, meaning:
|timestamp |ip |count|
|2022-01-06 11:58:53+00:00|1.1.1.5 |NaN |
|2022-01-08 03:56:35+00:00|10.10.10.24|2 |
|2022-01-09 22:29:30+00:00|3.3.3.89 |1 |
|2022-03-08 22:37:52+00:00|8.8.8.88 |1 |
What I tried is to first create another column in df1 with the previous timestamp with this:
df1 = df1.assign(timestamp_b4=df1.timestamp.shift(1)).fillna({'timestamp_b4': df1.timestamp})
which gives me:
|timestamp |ip |timestamp_b4 |
|2022-01-06 11:58:53+00:00|1.1.1.5 |2022-03-08 22:37:52+00:00|
|2022-01-08 03:56:35+00:00|10.10.10.24|2022-01-06 11:58:53+00:00|
|2022-01-09 22:29:30+00:00|3.3.3.89 |2022-01-08 03:56:35+00:00|
|2022-03-08 22:37:52+00:00|8.8.8.88 |2022-01-09 22:29:30+00:00|
and then do some sort of
s = (df2[df2['timestamp'].between(df1['timestamp'], df1['timestamp_b4'])].size())
But unfortunately it does not work since pandas requires comparing identically-labeled objects.
Is there a good pandas/pythonic way to do that?
Thanks
Upvotes: 3
Views: 129
Reputation: 887
Ok so in the end, here's what I did. I used @Drakax answer.
I created a column with the previous timestamps
df1 = df1.assign(previous_deconnection=df1.timestamp.shift(1)).fillna({'previous_deconnection': df1.timestamp})
then I set the first row value,
df1['previous_deconnection'].iloc[0]=pd.to_datetime('2022-01-01 00:00:00+00:00')
then I apply this function to each row of df1
def time_compare(a,b):
return len(b[((b['timestamp'] >= a['previous_deconnection']) & (b['timestamp'] <= a['timestamp']))])
df1['Count'] = df1.apply(lambda row: time_compare(row, df2), axis=1)
Upvotes: 0
Reputation: 3720
Here is one approach. Note that columns from df1 are retained in the final output df:
Starting with this df1 that has an additional column:
timestamp ip another_col
0 2022-01-06 11:58:53+00:00 1.1.1.5. val_1
1 2022-01-08 03:56:35+00:00 10.10.10.24 val_2
2 2022-01-09 22:29:30+00:00 3.3.3.89. val_3
3 2022-03-08 22:37:52+00:00 8.8.8.88. val_4
df1.merge(df2, on='timestamp', how='outer').sort_values('timestamp') \
.assign(c1=df1.loc[~df1['ip'].isna()]['ip'], c2=lambda x: x['c1'].bfill() ) \
.assign(count=lambda x: x.groupby('c2').apply('count').reset_index(drop=True)['timestamp']-1) \
.drop(['other','c1','c2'], axis=1).dropna().astype({'count': 'int32'})
timestamp ip another_col count
0 2022-01-06 11:58:53+00:00 1.1.1.5. val_1 0
1 2022-01-08 03:56:35+00:00 10.10.10.24 val_2 2
2 2022-01-09 22:29:30+00:00 3.3.3.89. val_3 1
3 2022-03-08 22:37:52+00:00 8.8.8.88. val_4 1
Note that another_col
is retained in the output.
This approach merges then sorts by timestamp then creates another column - c2 - that is used to copy the df1 timestamp and then backfill it against the df2 timestamps. From there the instances are grouped by df1 timestamp (reflected in the c2 column) and counted. In other words, the backfill of the df1 timestamp allows it to be used as a grouping key for counting the preceding df2 timestamps. After that the df is trimmed back down to match the output requirements.
Also note that with this approach the dataframes need to be indexed 0-n like they currently are in my example.
Upvotes: 1
Reputation: 1493
def time_compare(df1,df2):
return [np.sum((df1['timestamp'].values[i-1] < df2['timestamp'].values) & (df1['timestamp'].values[i] > df2['timestamp'].values)) for i in range(len(df1.timestamp))]
df2.join(pd.Series(time_compare(df1,df2), name='Count'))
Weird I can't post the dataframe output as usual:
index | timestamp | other | Count |
---|---|---|---|
0 | 2022-01-07 22:08:5900:00 | other | 0 |
1 | 2022-01-07 23:08:5900:00 | other | 2 |
2 | 2022-01-09 17:04:0900:00 | other | 1 |
3 | 2022-03-05 17:04:0900:00 | other | 1 |
Upvotes: 1
Reputation: 303
Try this, it's an example of what you can do to find the solution
import pandas as pd
table1 = {
'timestamp':['2022-01-06 11:58:53+00:00','2022-01-08 03:56:35+00:00',
'2022-01-09 22:29:30+00:00','2022-03-08 22:37:52+00:00'],
'other':['other','other','other','other']
}
df1 = pd.DataFrame(table1)
table2 = {
'timestamp':['2022-01-07 23:08:59+00:00','2022-01-07 22:08:59+00:00',
'2022-03-05 17:04:09+00:00','2022-01-09 17:04:09+00:00'],
'ip':['1.1.1.5.','10.10.10.24','3.3.3.89.','8.8.8.88.']
}
df2 = pd.DataFrame(table2)
print(f'\n\n-------------df1-----------\n\n')
print(df2)
print(f'\n\n-------------df2-----------\n\n')
print(df1)
listdf1 = df1['timestamp'].values.tolist()
def func(line):
cont = df1.loc[df1['timestamp'].str.contains(line[0][:7], case = False)]
temp = line.name - 1
if temp == -1:
temp = 0
try :
cont = [cont['timestamp'].iloc[temp],line[0]]
except:
cont = [line[0],line[0]]
cont2 = df2['timestamp'].loc[df2['timestamp'].str.contains(line[0][:7], case = False)]
repetitions = 0
for x in cont2:
if int(x[8:10]) >= int(cont[0][8:10]) and int(x[8:10]) <= int(cont[1][8:10]) and int(x[8:10]) <= int(line[0][8:10]):
repetitions += 1
return repetitions
print(f'\n\n-------------BREAK-----------\n\n')
df1['count'] = df1.apply(func, axis = 1)
print(df1)
Upvotes: 0