Subtract value from data with keys through multiple columns in R

Question

I have two data.tables with multiple columns as keys (they consists of the columns record, dstPort, srcPort, proto, dstIP, and srcIP). Both have the same format.

dataset_1:

    record dstPort srcPort proto         dstIP          srcIP  state        timestamp
1:  state      80   32768   tcp 192.168.101.5 192.168.101.89     syn 1466580661185059
2:  state      80   32768   tcp 192.168.101.5 192.168.101.89 syn_ack 1466520661604781
3:  state      80   32768   tcp 192.168.101.5 192.168.101.89   close 1466532661885439
4:  state      80   55555   tcp 192.168.101.5 192.168.101.89     syn 1466532661885440

and dataset_2:

   record dstPort srcPort proto         dstIP          srcIP            state        timestamp
1:  state      80   32768   tcp 192.168.101.5 192.168.101.89      established 1466537661727619
2:  state      80   32768   tcp 192.168.101.5 192.168.101.89            close 1466532661986891
3:  state      80   44444   tcp 192.168.101.5 192.168.101.89      established 1466537661727619

The following is what I would like to do for every key in the dataset: I want to find the records (rows) with the same key and where a given state is available (i.e. state syn in dataset_1 and established in dataset_2 ). For these records I want to subtract the timestamps from each other. I.e.:

For every Key in dataset_1, i.e.:

state 80 32768 tcp 192.168.101.5 192.168.101.89 for state syn gives timestamp 1466580661185059

and Key in dataset_2:

state 80 32768 tcp 192.168.101.5 192.168.101.89 for state established gives timestamp 1466537661727619

After subtracting timestamps: 1466580661185059-1466537661727619 = 42999457440

It could be that there is no record for a key in dataset_2. This is why sorting does not work (which is what all my tries were based to). An exemplary try is (after having them sort which is not possible anymore):

dt_state1 <- subset(dt, state == 'established')
dt_state2 <- subset(dt, state == 'syn')
dt_delta_test <- data.table(x=(dt_state1$timestamp/1000)- (dt_state2$timestamp/1000),'timestamp'= dt_state1$timestamp-min(dt_state1$timestamp))

Update 1: @lmo:

F1_in = as.data.table(read.csv(file=Filename, header=TRUE, sep=","))
keys=c("record","dstPort","srcPort","dstIP","srcIP")
state1 = 'syn'
state2 = 'established'
dt_state1 <- subset(F1_in, state == state2)
setkey(dt_state1, keys)
Error in setkeyv(x, cols, verbose = verbose, physical = physical) : some columns are not in the data.table: keys
dt_state2 <- subset(F1_in, state == state1)
setkey(dt_state2, keys)
Error in setkeyv(x, cols, verbose = verbose, physical = physical) : some columns are not in the data.table: keys
dt_state1[dt_state2, timestamp - i.timestamp]
Error in `[.data.table`(dt_state1, dt_state2, timestamp - i.timestamp) : 
  When i is a data.table (or character vector), x must be keyed (i.e. sorted, and, marked as sorted) so data.table knows which columns to join to and take advantage of x being sorted. Call setkey(x,...) first, see ?setkey.

I don't know why this error occurs..

@toni057 Your solution does not change anything for me (I had to do some changes because it threw some errors). I tried the following code:

F1_in = as.data.table(read.csv(file=Filename, header=TRUE, sep=","))
keys=c("record","dstPort","srcPort","dstIP","srcIP")
state1 = 'syn'
state2 = 'established'
dt_state1 <- subset(F1_in, state == state2)
setkey(dt_state1, keys)
dt_state2 <- subset(F1_in, state == state1)
setkey(dt_state2, keys)

dt_state1 %>%
  filter("state" == 'syn') %>%
  left_join(filter(dt_state2, "state" == 'established'), by = keys) %>%
  mutate(timestamp_diff = timestamp.x - timestamp.y)

I also changed the dt of the second filter. But there is no change in dt_state1 at all..

lmo · Accepted Answer

If your goal is to take differences of the timestamps between the two data.tables where they both share the same key, you could use left join, and then calculate the difference:

# get stuff set up
library(data.table)
# convert data.frames to data.tables by reference
setDT(dt_state1)
setDT(dt_state2)
# set keys
setkey(dt_state1, record, dstPort, srcPort, proto, dstIP, srcIP)
setkey(dt_state2, record, dstPort, srcPort, proto, dstIP, srcIP)


# perform left join and get timestamp difference
dt_state1[dt_state2, timestamp - i.timestamp]

[1]  42999457440 -17000122838  -4999842180  47999198168 -12000382110      -101452     NA

This performs a left join (which subsets the observations in dt_state1 to include only those present in dt_state2) and subtracts dt_state2's timestamp from dt_state1.

The first entry of the returned vector is the value you listed in your example.

data

dt_state1 <- read.table(header=T, text="
record dstPort srcPort proto    dstIP   srcIP  state        timestamp
1:  state      80   32768   tcp 192.168.101.5 192.168.101.89  syn 1466580661185059
2:  state      80   32768   tcp 192.168.101.5 192.168.101.89 syn_ack 1466520661604781
3:  state      80   32768   tcp 192.168.101.5 192.168.101.89 close 1466532661885439
4:  state      80   55555   tcp 192.168.101.5 192.168.101.89   syn 1466532661885440")

dt_state2 <- read.table(header=T, text="
record dstPort srcPort proto     dstIP    srcIP    state   timestamp
1:  state      80   32768   tcp 192.168.101.5 192.168.101.89  established 1466537661727619
2:  state      80   32768   tcp 192.168.101.5 192.168.101.89  close 1466532661986891
3:  state      80   44444   tcp 192.168.101.5 192.168.101.89 established 1466537661727619")

Subtract value from data with keys through multiple columns in R

Answers (2)

Related Questions