Reputation: 586
I have two data.table
s with multiple columns as keys (they consists of the columns record
, dstPort
, srcPort
, proto
, dstIP
, and srcIP
).
Both have the same format.
dataset_1:
record dstPort srcPort proto dstIP srcIP state timestamp 1: state 80 32768 tcp 192.168.101.5 192.168.101.89 syn 1466580661185059 2: state 80 32768 tcp 192.168.101.5 192.168.101.89 syn_ack 1466520661604781 3: state 80 32768 tcp 192.168.101.5 192.168.101.89 close 1466532661885439 4: state 80 55555 tcp 192.168.101.5 192.168.101.89 syn 1466532661885440
and dataset_2:
record dstPort srcPort proto dstIP srcIP state timestamp 1: state 80 32768 tcp 192.168.101.5 192.168.101.89 established 1466537661727619 2: state 80 32768 tcp 192.168.101.5 192.168.101.89 close 1466532661986891 3: state 80 44444 tcp 192.168.101.5 192.168.101.89 established 1466537661727619
The following is what I would like to do for every key in the dataset:
I want to find the records (rows) with the same key and where a given state is available (i.e. state syn
in dataset_1 and established
in dataset_2 ).
For these records I want to subtract the timestamps from each other.
I.e.:
For every Key in dataset_1, i.e.:
state 80 32768 tcp 192.168.101.5 192.168.101.89
for state syn
gives timestamp 1466580661185059
and Key in dataset_2:
state 80 32768 tcp 192.168.101.5 192.168.101.89
for state established
gives timestamp 1466537661727619
After subtracting timestamps: 1466580661185059-1466537661727619 = 42999457440
It could be that there is no record for a key in dataset_2. This is why sorting does not work (which is what all my tries were based to). An exemplary try is (after having them sort which is not possible anymore):
dt_state1 <- subset(dt, state == 'established')
dt_state2 <- subset(dt, state == 'syn')
dt_delta_test <- data.table(x=(dt_state1$timestamp/1000)- (dt_state2$timestamp/1000),'timestamp'= dt_state1$timestamp-min(dt_state1$timestamp))
Update 1: @lmo:
F1_in = as.data.table(read.csv(file=Filename, header=TRUE, sep=","))
keys=c("record","dstPort","srcPort","dstIP","srcIP")
state1 = 'syn'
state2 = 'established'
dt_state1 <- subset(F1_in, state == state2)
setkey(dt_state1, keys)
Error in setkeyv(x, cols, verbose = verbose, physical = physical) : some columns are not in the data.table: keys
dt_state2 <- subset(F1_in, state == state1)
setkey(dt_state2, keys)
Error in setkeyv(x, cols, verbose = verbose, physical = physical) : some columns are not in the data.table: keys
dt_state1[dt_state2, timestamp - i.timestamp]
Error in `[.data.table`(dt_state1, dt_state2, timestamp - i.timestamp) :
When i is a data.table (or character vector), x must be keyed (i.e. sorted, and, marked as sorted) so data.table knows which columns to join to and take advantage of x being sorted. Call setkey(x,...) first, see ?setkey.
I don't know why this error occurs..
@toni057 Your solution does not change anything for me (I had to do some changes because it threw some errors). I tried the following code:
F1_in = as.data.table(read.csv(file=Filename, header=TRUE, sep=","))
keys=c("record","dstPort","srcPort","dstIP","srcIP")
state1 = 'syn'
state2 = 'established'
dt_state1 <- subset(F1_in, state == state2)
setkey(dt_state1, keys)
dt_state2 <- subset(F1_in, state == state1)
setkey(dt_state2, keys)
dt_state1 %>%
filter("state" == 'syn') %>%
left_join(filter(dt_state2, "state" == 'established'), by = keys) %>%
mutate(timestamp_diff = timestamp.x - timestamp.y)
I also changed the dt of the second filter. But there is no change in dt_state1 at all..
Upvotes: 0
Views: 238
Reputation: 604
library(dplyr)
dt_state1 %>%
filter(state == 'syn') %>%
left_join(filter(dt_state2, state == 'established), by = insert all you keys here) %>%
mutate(timestamp_diff = timestamp.x - timestamp.y)
Upvotes: 0
Reputation: 38520
If your goal is to take differences of the timestamps between the two data.tables where they both share the same key, you could use left join, and then calculate the difference:
# get stuff set up
library(data.table)
# convert data.frames to data.tables by reference
setDT(dt_state1)
setDT(dt_state2)
# set keys
setkey(dt_state1, record, dstPort, srcPort, proto, dstIP, srcIP)
setkey(dt_state2, record, dstPort, srcPort, proto, dstIP, srcIP)
# perform left join and get timestamp difference
dt_state1[dt_state2, timestamp - i.timestamp]
[1] 42999457440 -17000122838 -4999842180 47999198168 -12000382110 -101452 NA
This performs a left join (which subsets the observations in dt_state1 to include only those present in dt_state2) and subtracts dt_state2's timestamp from dt_state1.
The first entry of the returned vector is the value you listed in your example.
data
dt_state1 <- read.table(header=T, text="
record dstPort srcPort proto dstIP srcIP state timestamp
1: state 80 32768 tcp 192.168.101.5 192.168.101.89 syn 1466580661185059
2: state 80 32768 tcp 192.168.101.5 192.168.101.89 syn_ack 1466520661604781
3: state 80 32768 tcp 192.168.101.5 192.168.101.89 close 1466532661885439
4: state 80 55555 tcp 192.168.101.5 192.168.101.89 syn 1466532661885440")
dt_state2 <- read.table(header=T, text="
record dstPort srcPort proto dstIP srcIP state timestamp
1: state 80 32768 tcp 192.168.101.5 192.168.101.89 established 1466537661727619
2: state 80 32768 tcp 192.168.101.5 192.168.101.89 close 1466532661986891
3: state 80 44444 tcp 192.168.101.5 192.168.101.89 established 1466537661727619")
Upvotes: 2