Reputation: 2117
I am working with 2 dataframes. The first has incomplete information. The second dataframe has information with a time range of first seen and last seen. I am trying to use the source address and the time range from df2 to fill in the sourcehostname and sourceusername where datetime from df1 falls into that time range.
df1
sourceaddress sourcehostname sourceusername endtime datetime
0 10.0.0.59 computer1 NaN 1564666638000 2019-08-01 09:37:18
1 10.0.0.59 NaN NaN 1564666640000 2019-08-01 09:37:20
2 10.0.0.59 NaN NaN 1564666642000 2019-08-01 09:37:22
3 10.0.0.59 NaN NaN 1564666643000 2019-08-01 09:37:23
4 10.0.0.59 NaN NaN 1564666643000 2019-08-01 09:37:23
5 10.0.0.59 NaN NaN 1564666645000 2019-08-01 09:37:25
6 10.0.0.59 computer1 NaN 1564666646000 2019-08-01 09:37:26
7 10.0.0.59 NaN NaN 1564666646000 2019-08-01 09:37:26
8 10.0.0.59 computer1 NaN 1564666649000 2019-08-01 09:37:29
9 10.0.0.59 computer1 NaN 1564666650000 2019-08-01 09:37:30
10 10.0.0.59 NaN NaN 1564666850000 2019-08-01 09:40:50
...
43196 10.0.0.187 computer2 NaN 1564718395000 2019-08-01 23:59:55
43197 10.0.0.187 computer2 user1 1564718397000 2019-08-01 23:59:57
43198 10.0.0.187 computer2 NaN 1564718397000 2019-08-01 23:59:57
43199 10.0.0.187 computer2 user1 1564718398000 2019-08-01 23:59:58
43200 10.0.0.187 NaN NaN 1564718398000 2019-08-01 23:59:58
43201 10.0.0.187 computer2 user1 1564718398000 2019-08-01 23:59:58
df2
sourceaddress sourcehostname sourceusername firstseen lastseen
0 10.0.0.59 computer1 user1 2019-08-01 09:37:59 2019-08-01 09:46:08
1 10.0.0.187 computer2 user1 2019-08-01 00:00:03 2019-08-01 23:59:58
Desired Result:
df3
sourceaddress sourcehostname sourceusername endtime datetime
0 10.0.0.59 computer1 NaN 1564666638000 2019-08-01 09:37:18
1 10.0.0.59 NaN NaN 1564666640000 2019-08-01 09:37:20
2 10.0.0.59 NaN NaN 1564666642000 2019-08-01 09:37:22
3 10.0.0.59 NaN NaN 1564666643000 2019-08-01 09:37:23
4 10.0.0.59 NaN NaN 1564666643000 2019-08-01 09:37:23
5 10.0.0.59 NaN NaN 1564666645000 2019-08-01 09:37:25
6 10.0.0.59 computer1 NaN 1564666646000 2019-08-01 09:37:26
7 10.0.0.59 NaN NaN 1564666646000 2019-08-01 09:37:26
8 10.0.0.59 computer1 NaN 1564666649000 2019-08-01 09:37:29
9 10.0.0.59 computer1 NaN 1564666650000 2019-08-01 09:37:30
10 10.0.0.59 computer1 user1 1564668650000 2019-08-01 10:10:50
...
43196 10.0.0.187 computer2 user1 1564718395000 2019-08-01 23:59:55
43197 10.0.0.187 computer2 user1 1564718397000 2019-08-01 23:59:57
43198 10.0.0.187 computer2 user1 1564718397000 2019-08-01 23:59:57
43199 10.0.0.187 computer2 user1 1564718398000 2019-08-01 23:59:58
43200 10.0.0.187 computer2 user1 1564718398000 2019-08-01 23:59:58
43201 10.0.0.187 computer2 user1 1564718398000 2019-08-01 23:59:58
**following the example from below:
df3[-5:]
sourceaddress sourcehostname sourceusername endtime datetime firstseen lastseen
43197 10.99.0.187 computer2 user1 1564718397000 2019-08-01 23:59:57 2019-08-01 00:00:03 2019-08-01 23:59:58
43198 10.99.0.187 computer2 NaN 1564718397000 2019-08-01 23:59:57 2019-08-01 00:00:03 2019-08-01 23:59:58
43199 10.99.0.187 computer2 NaN 1564718398000 2019-08-01 23:59:58 2019-08-01 00:00:03 2019-08-01 23:59:58
43200 10.99.0.187 computer2 user1 1564718398000 2019-08-01 23:59:58 2019-08-01 00:00:03 2019-08-01 23:59:58
43201 10.99.0.187 computer2 user1 1564718398000 2019-08-01 23:59:58 2019-08-01 00:00:03 2019-08-01 23:59:58
Upvotes: 1
Views: 43
Reputation: 150785
It looks like a merge
problem:
df3 = df1.merge(df2,
on='sourceaddress', how='left',
suffixes=['','_df2']
)
# mark the valid time:
mask = df3['datetime'].ge(df3['firstseen']) & df3['datetime'].lt(df3['lastseen'])
# update the info
df3.loc[mask, 'sourcehostname'] = df3.loc[mask, 'sourcehostname_df2']
df3.loc[mask, 'sourceusername'] = df3.loc[mask, 'sourceusername_df2']
And then you can drop sourcehostname_df2
and sourceusername_df2
.
Upvotes: 1