Reputation: 11
My problem is as follows: I need to analyse data from several different files with a lot of entries (up to 500.000 per column, 10 columns in total). The files are connected through the use of IDs, e.g. ORDER_IDs. However, the IDs can appear multiple times, e.g. when an order contains multiple order lines. It is also possible, that one ID doesn't appear in one of the files, e.g. because a file with sales data does only have information on the orders shipped, but not those that have not been shippet yet.
So I have different files with different lengths and unique IDs identifying positions that can vary in their appearance (It is there or not) over all the data files. What I want now is to filter one file by ID so that it only shows the IDs listed in another file. Also, the additional columns from the first file should be moved over.
Example of what I have: dt1:
ORDER_ID SKU_ID Quantity_Shipped
12345 678910 100
12346 648392 30
64739 648392 20
dt2:
ORDER_ID Country
12345 DE
12346 DE
55430 SE
90632 JPN
76543 ARG
64739 CH
What I want:
ORDER_ID SKU_ID Quantity_Shipped Country
12345 678910 100 DE
12346 648392 30 DE
64739 648392 20 CH
Originally, the data was imported from a csv file. The approach I used so far has worked when merging two files. When trying to add the information from a third file hoewever, I get only NA as answers. What can I do to fix this?
This is the approach I used so far.
df2 <- data.frame(ORDER_ID = sales[["ORDER_ID"]])
df1 <- data.frame(ORDER_ID = OL[["ORDER_ID"]], SKU_ID = OL[["SKU_ID"]],
QTY_SHIPPED = OL[["QTY_SHIPPED"]], EXPECTED_VOLUME =
OL[["EXPECTED_VOLUME"]])
library(data.table)
dt2 <- data.table(df1)
dt1 <- data.table(df2)
dt3 <- dt2[match(dt1$ORDER_ID, dt2$ORDER_ID), ]
Upvotes: 1
Views: 1196
Reputation: 160687
You can use either the inherent "merge" within data.table
, or the explicit merge
command (which is also calling a data.table
S3 method, but that doesn't entirely matter here).
dt2[dt1, on = "ORDER_ID"]
# ORDER_ID Country SKU_ID Quantity_Shipped
# 1: 12345 DE 678910 100
# 2: 12346 DE 648392 30
# 3: 64739 CH 648392 20
merge(dt1, dt2, by = "ORDER_ID")
Sometimes I prefer to clarity of the merge
call in that I control left/right and other aspects (if the default first-use above doesn't do what I want). I found https://rstudio-pubs-static.s3.amazonaws.com/52230_5ae0d25125b544caab32f75f0360e775.html has a good reference for left/right and other join types.
One time when merge
will not work sufficiently is if you are doing range-joins, either using data.table::foverlaps
or an extension of the inherent method:
# with mythical data, joining on `dat1$val` within `dat2$val` and `dat2$val2`
dat1[dat2, on = .( val >= val1 & val <= val2 )]
Upvotes: 2