sectechguy
sectechguy

Reputation: 2117

Use IP column of 2 dataframes and date range to populate df1 dataframe with data from df2

I am working with 2 dataframes. The first has incomplete information. The second dataframe has information with a time range of first seen and last seen. I am trying to use the source address and the time range from df2 to fill in the sourcehostname and sourceusername where datetime from df1 falls into that time range.

df1
        sourceaddress   sourcehostname  sourceusername  endtime         datetime
0       10.0.0.59       computer1       NaN             1564666638000   2019-08-01 09:37:18
1       10.0.0.59       NaN             NaN             1564666640000   2019-08-01 09:37:20
2       10.0.0.59       NaN             NaN             1564666642000   2019-08-01 09:37:22
3       10.0.0.59       NaN             NaN             1564666643000   2019-08-01 09:37:23
4       10.0.0.59       NaN             NaN             1564666643000   2019-08-01 09:37:23
5       10.0.0.59       NaN             NaN             1564666645000   2019-08-01 09:37:25
6       10.0.0.59       computer1       NaN             1564666646000   2019-08-01 09:37:26
7       10.0.0.59       NaN             NaN             1564666646000   2019-08-01 09:37:26
8       10.0.0.59       computer1       NaN             1564666649000   2019-08-01 09:37:29
9       10.0.0.59       computer1       NaN             1564666650000   2019-08-01 09:37:30
10      10.0.0.59       NaN             NaN             1564666850000   2019-08-01 09:40:50
...
43196   10.0.0.187      computer2       NaN             1564718395000   2019-08-01 23:59:55
43197   10.0.0.187      computer2       user1           1564718397000   2019-08-01 23:59:57
43198   10.0.0.187      computer2       NaN             1564718397000   2019-08-01 23:59:57
43199   10.0.0.187      computer2       user1           1564718398000   2019-08-01 23:59:58
43200   10.0.0.187      NaN             NaN             1564718398000   2019-08-01 23:59:58
43201   10.0.0.187      computer2       user1           1564718398000   2019-08-01 23:59:58

df2
        sourceaddress   sourcehostname  sourceusername  firstseen             lastseen
0       10.0.0.59       computer1       user1           2019-08-01 09:37:59   2019-08-01 09:46:08
1       10.0.0.187      computer2       user1           2019-08-01 00:00:03   2019-08-01 23:59:58

Desired Result:

df3
        sourceaddress   sourcehostname  sourceusername  endtime         datetime
0       10.0.0.59       computer1       NaN             1564666638000   2019-08-01 09:37:18
1       10.0.0.59       NaN             NaN             1564666640000   2019-08-01 09:37:20
2       10.0.0.59       NaN             NaN             1564666642000   2019-08-01 09:37:22
3       10.0.0.59       NaN             NaN             1564666643000   2019-08-01 09:37:23
4       10.0.0.59       NaN             NaN             1564666643000   2019-08-01 09:37:23
5       10.0.0.59       NaN             NaN             1564666645000   2019-08-01 09:37:25
6       10.0.0.59       computer1       NaN             1564666646000   2019-08-01 09:37:26
7       10.0.0.59       NaN             NaN             1564666646000   2019-08-01 09:37:26
8       10.0.0.59       computer1       NaN             1564666649000   2019-08-01 09:37:29
9       10.0.0.59       computer1       NaN             1564666650000   2019-08-01 09:37:30
10      10.0.0.59       computer1       user1           1564668650000   2019-08-01 10:10:50
...
43196   10.0.0.187      computer2       user1           1564718395000   2019-08-01 23:59:55
43197   10.0.0.187      computer2       user1           1564718397000   2019-08-01 23:59:57
43198   10.0.0.187      computer2       user1           1564718397000   2019-08-01 23:59:57
43199   10.0.0.187      computer2       user1           1564718398000   2019-08-01 23:59:58
43200   10.0.0.187      computer2       user1           1564718398000   2019-08-01 23:59:58
43201   10.0.0.187      computer2       user1           1564718398000   2019-08-01 23:59:58

**following the example from below:

df3[-5:]
        sourceaddress   sourcehostname  sourceusername  endtime          datetime               firstseen              lastseen
43197   10.99.0.187     computer2       user1           1564718397000    2019-08-01 23:59:57    2019-08-01 00:00:03    2019-08-01 23:59:58
43198   10.99.0.187     computer2       NaN             1564718397000    2019-08-01 23:59:57    2019-08-01 00:00:03    2019-08-01 23:59:58
43199   10.99.0.187     computer2       NaN             1564718398000    2019-08-01 23:59:58    2019-08-01 00:00:03    2019-08-01 23:59:58
43200   10.99.0.187     computer2       user1           1564718398000    2019-08-01 23:59:58    2019-08-01 00:00:03    2019-08-01 23:59:58
43201   10.99.0.187     computer2       user1           1564718398000    2019-08-01 23:59:58    2019-08-01 00:00:03    2019-08-01 23:59:58

Upvotes: 1

Views: 43

Answers (1)

Quang Hoang
Quang Hoang

Reputation: 150785

It looks like a merge problem:

df3 = df1.merge(df2,
                on='sourceaddress', how='left',
                suffixes=['','_df2']
               )
# mark the valid time:
mask = df3['datetime'].ge(df3['firstseen']) & df3['datetime'].lt(df3['lastseen'])

# update the info
df3.loc[mask, 'sourcehostname'] = df3.loc[mask, 'sourcehostname_df2']
df3.loc[mask, 'sourceusername'] = df3.loc[mask, 'sourceusername_df2']

And then you can drop sourcehostname_df2 and sourceusername_df2.

Upvotes: 1

Related Questions