Reputation: 55
Consider this data set:
> DATA <- data.frame(Agreement_number = c(1,1,1,1,2,2,2,2),
+ country = c("Canada","Canada", "USA", "USA", "Canada","Canada", "USA", "USA"),
+ action = c("signature", "ratification","signature", "ratification", "signature", "ratification","signature", "ratification"),
+ signature_date = c(2000,NA,2000,NA, 2001, NA, 2002, NA),
+ ratification_date = c(NA, 2001, NA, 2002, NA, 2001, NA, 2002))
> DATA
Agreement_number country action signature_date ratification_date
1 Canada signature 2000 NA
1 Canada ratification NA 2001
1 USA signature 2000 NA
1 USA ratification NA 2002
2 Canada signature 2001 NA
2 Canada ratification NA 2001
2 USA signature 2002 NA
2 USA ratification NA 2002
As you can see, half of the rows have duplicate information. For a small data set like this it is really easy to remove duplicates. I could use the coalesce
function (dplyr package), get rid of the "action" column and then erase all the irrelevant rows. Though, there many other ways. The final result should look like this:
> DATA <- data.frame( Agreement_number = c(1,1,2,2),
+ country = c("Canada", "USA", "Canada","USA"),
+ signature_date = c(2000,2000,2001,2002),
+ ratification_date = c(2001, 2002, 2001, 2002))
> DATA
Agreement_number country signature_date ratification_date
1 Canada 2000 2001
1 USA 2000 2002
2 Canada 2001 2001
2 USA 2002 2002
The problem, is that my real data set is MUCH bigger (102000 x 270) and there are many more variables. The real data is also more irregular and there are more absent values. The coalesce
function seems very slow. The best loop I could make so far still takes up to 5-10 minutes to run.
Is there a simple way of doing this which would be faster? I have the feeling that there must be some function in R for that kind of operation, but I couldn't find any.
Upvotes: 3
Views: 266
Reputation: 789
I think you need dcast
. The version in the data.table
library calls itself "fast", and in my experience, it is speedy on large datasets.
First, let's create one column which is either the signature_date
or ratification_date
, depending on the action
library(data.table)
setDT(DATA)[, date := ifelse(action == "ratification", ratification_date, signature_date)]
Now, let's cast it so that the action are the columns and the value is the date
wide <- dcast(DATA, Agreement_number + country ~ action, value.var = 'date')
So wide looks like this
Agreement_number country ratification signature
1 1 Canada 2001 2000
2 1 USA 2002 2000
3 2 Canada 2001 2001
4 2 USA 2002 2002
Upvotes: 4
Reputation: 42544
The OP has told that his production data has 100 k rows x 270 columns, and speed is a concern for him. Therefore, I suggest to use data.table
.
I'm aware that Harland also has proposed to use data.table
and dcast()
but the solution below is a different approach. It brings the rows in the correct order and copies the ratification_date
to the signature row. After some clean-up we get the desired result.
library(data.table)
# coerce to data.table,
# make sure that the actions are ordered properly, not alphabetically
setDT(DATA)[, action := ordered(action, levels = c("signature", "ratification"))]
# order the rows to make sure that signature row and ratification row are
# subsequent for each agreement and country
setorder(DATA, Agreement_number, country, action)
# copy the ratification date from the row below but only within each group
result <- DATA[, ratification_date := shift(ratification_date, type = "lead"),
by = c("Agreement_number", "country")][
# keep only signature rows, remove action column
action == "signature"][, action := NULL]
result
Agreement_number country signature_date ratification_date dummy1 dummy2 1: 1 Canada 2000 2001 2 D 2: 1 USA 2000 2002 3 A 3: 2 Canada 2001 2001 1 B 4: 2 USA 2002 2002 4 C
The OP has mentioned that his production data has 270 columns. To simulate this I've added two dummy columns:
set.seed(123L)
DATA <- data.frame(Agreement_number = c(1,1,1,1,2,2,2,2),
country = c("Canada","Canada", "USA", "USA", "Canada","Canada", "USA", "USA"),
action = c("signature", "ratification","signature", "ratification", "signature", "ratification","signature", "ratification"),
signature_date = c(2000,NA,2000,NA, 2001, NA, 2002, NA),
ratification_date = c(NA, 2001, NA, 2002, NA, 2001, NA, 2002),
dummy1 = rep(sample(4), each = 2L),
dummy2 = rep(sample(LETTERS[1:4]), each = 2L))
Note that set.seed()
is used for repeatable results when sampling.
Agreement_number country action signature_date ratification_date dummy1 dummy2 1 1 Canada signature 2000 NA 2 D 2 1 Canada ratification NA 2001 2 D 3 1 USA signature 2000 NA 3 A 4 1 USA ratification NA 2002 3 A 5 2 Canada signature 2001 NA 1 B 6 2 Canada ratification NA 2001 1 B 7 2 USA signature 2002 NA 4 C 8 2 USA ratification NA 2002 4 C
dcast()
with additional columnsHarland has suggested to use data.table
and dcast()
. Besides several other flaws in his answer, it doesn't handle the additional columns the OP has mentioned.
The dcast()
approach below will return also the additional columns:
library(data.table)
# coerce to data table
setDT(DATA)[, action := ordered(action, levels = c("signature", "ratification"))]
# use already existing column to "coalesce" dates
DATA[action == "ratification", signature_date := ratification_date]
DATA[, ratification_date := NULL]
# dcast from long to wide form, note that ... refers to all other columns
result <- dcast(DATA, Agreement_number + country + ... ~ action,
value.var = "signature_date")
result
Agreement_number country dummy1 dummy2 signature ratification 1: 1 Canada 2 D 2000 2001 2: 1 USA 3 A 2000 2002 3: 2 Canada 1 B 2001 2001 4: 2 USA 4 C 2002 2002
Note that this approach will change the order of columns.
Upvotes: 3
Reputation: 38500
Here is another data.table
solution using uwe-block's data.frame. It is similar to uwe-block's method, but uses max
to collapse the data.
# covert data.frame to data.table and factor variables to character variables
library(data.table)
setDT(DATA)[, names(DATA) := lapply(.SD,
function(x) if(is.factor(x)) as.character(x) else x)]
# collapse data set, by agreement and country. Take max of remaining variables.
DATA[, lapply(.SD, max, na.rm=TRUE), by=.(Agreement_number, country)][,action := NULL][]
The lapply
runs through variables not included in the by statement and calculates the maximum after removing NA values. The next link in the chain drops the unneeded action variable and the final (unnecessary) link prints the output.
This returns
Agreement_number country signature_date ratification_date dummy1 dummy2
1: 1 Canada 2000 2001 2 D
2: 1 USA 2000 2002 3 A
3: 2 Canada 2001 2001 1 B
4: 2 USA 2002 2002 4 C
Upvotes: 2