Reputation: 507
Suppose we have the following data. The rows represent a country and the columns (in05:in09
) indicate whether that country was present in a database of interest in the given year (2005:2009
).
id <- c("a", "b", "c", "d")
in05 <- c(1, 0, 0, 1)
in06 <- c(0, 0, 0, 1)
in07 <- c(1, 1, 0, 1)
in08 <- c(0, 1, 1, 1)
in09 <- c(0, 0, 0, 1)
df <- data.frame(id, in05, in06, in07, in08, in09)
I want to create a variable firstyear
which indicates the first year in which the country was present in the database. Right now I do the following:
df$firstyear <- ifelse(df$in05==1,2005,
ifelse(df$in06==1,2006,
ifelse(df$in07==1, 2007,
ifelse(df$in08==1, 2008,
ifelse(df$in09==1, 2009,
0)))))
The above code is already not very nice, and my dataset contains many more years. Is there an alternative, using *apply
functions, loops or something else, to create this firstyear
variable?
Upvotes: 29
Views: 12859
Reputation: 3321
I always prefer to work with tidied data. First method filters on cumsums
# Tidy
df <- df %>%
gather(year, present.or.not, -id)
# Create df of first instances
first.df <- df %>%
group_by(id, present.or.not) %>%
mutate(ranky = rank(cumsum(present.or.not)),
first.year = year) %>%
filter(ranky == 1)
# Prepare for join
first.df <- first.df[,c('id', 'first.year')]
# Join with original
df <- left_join(df,first.df)
# Spread
spread(df, year, present.or.not)
Or this alternative that, after tidying, slices the first row from arranged groups.
df %>%
gather(year, present_or_not, -id) %>%
filter(present_or_not==1) %>%
group_by(id) %>%
arrange(id, year) %>%
slice(1) %>%
mutate(year = str_replace(year, "in", "20")) %>%
select(1:2) %>%
right_join(df)`
Upvotes: 2
Reputation: 919
You can use dplyr::case_when
inside dplyr::mutate()
along the lines of the method presented in this tweet.
# Using version 0.5.0.
# Dev version may work without `with()`.
df %>%
mutate(., firstyear = with(., case_when(
in05 == 1 ~ 2005,
in06 == 1 ~ 2006,
in07 == 1 ~ 2007,
in08 == 1 ~ 2008,
in09 == 1 ~ 2009,
TRUE ~ 0
)))
Upvotes: 8
Reputation: 13570
Other messy alternatives:
library(tidyr)
library(sqldf)
newdf <- gather(df, year, code, -id)
df$firstyear <- sqldf('SELECT min(rowid) rowid, id, year as firstyear
FROM newdf
WHERE code = 1
GROUP BY id')[3]
library(tidyr)
df2 <- gather(df, year, code, -id)
df2 <- df2[df2$code == 1, 1:2]
df2 <- df2[!duplicated(df2$id), ]
merge(df, df2)
library(tidyr)
library(dplyr)
newdf <- gather(df, year, code, -id)
df$firstyear <- (newdf %>%
filter(code==1) %>%
select(id, year) %>%
group_by(id) %>%
summarise(first = first(year)))[2]
Output:
id in05 in06 in07 in08 in09 year
1 a 1 0 1 0 0 in05
2 b 0 0 1 1 0 in07
3 c 0 0 0 1 0 in08
4 d 1 1 1 1 1 in05
A cleaner solution combining plaforts solution with alexises_laz is:
names(df) <- c("id", 2005, 2006, 2007, 2008, 2009)
df$firstyear <- names(df[-1])[apply(df[-1], 1, which.max)]
id 2005 2006 2007 2008 2009 firstyear
1 a 1 0 1 0 0 2005
2 b 0 0 1 1 0 2007
3 c 0 0 0 1 0 2008
4 d 1 1 1 1 1 2005
If we'd like to keep the original column names we could use the renaming provided by @David Arenburg.
df$firstYear <- gsub('in', '20', names(df[-1]))[apply(df[-1], 1, which.max)]
id in05 in06 in07 in08 in09 firstYear
1 a 1 0 1 0 0 2005
2 b 0 0 1 1 0 2007
3 c 0 0 0 1 0 2008
4 d 1 1 1 1 1 2005
Upvotes: 0
Reputation: 52637
Here is another option:
years <- as.integer(substr(names(df[-1]), 3, 4)) + 2000L
cbind(df, yr=do.call(pmin.int, Map(`/`, years, df[-1])))
Produces:
id in05 in06 in07 in08 in09 yr
1 a 1 0 1 0 0 2005
2 b 0 0 1 1 0 2007
3 c 0 0 0 1 0 2008
4 d 1 1 1 1 1 2005
And is fast. Here timing only the finding the min year step using Alexis' data:
Unit: milliseconds
expr min lq median uq max neval
do.call(pmin.int, Map(`/`, 11:20, DF[-1])) 178.46993 194.3760 219.8898 229.1597 307.1120 10
ff(DF[-1]) 416.07297 434.0792 439.1970 452.8345 496.2048 10
max.col(DF[-1], "first") 99.71936 138.2285 175.2334 207.6365 239.6519 10
Oddly this doesn't reproduce Alexis' timings, showing David's as the fastest. This is on R 3.1.2.
EDIT: based on convo with Frank, I updated Alexis function to be more compatible with R 3.1.2:
ff2 = function(x) {
ans = as.integer(x[[1]])
for(i in 2:length(x)) {
inds = which(ans == 0L)
if(!length(inds)) return(ans)
ans[inds] = i * (x[[i]][inds] == 1)
}
return(ans)
}
And this comes closer to the original results:
Unit: milliseconds
expr min lq median uq max neval
ff(DF[-1]) 407.92699 415.11716 421.18274 428.02092 462.2474 10
ff2(DF[-1]) 64.20484 72.74729 79.85748 81.29153 148.6439 10
Upvotes: 4
Reputation: 13122
Another answer with some notes of efficiency (although this QA is not about speed).
Firstly, it could be better to avoid the conversion of a "list"-y structure to a "matrix"; sometimes it's worth to convert to a "matrix" and use a function that handles efficiently a 'vector with a "dim" attribute' (i.e. a "matrix"/"array") - other times it's not. Both max.col
and apply
convert to a "matrix".
Secondly, in situations like these, where we do not need to check all the data while getting to a solution, we could benefit from a solution with a loop that controls what goes through to the next iteration. Here we know that we can stop when we've found the first "1". Both max.col
(and which.max
) have to loop once to, actually, find the maximum value; the fact that we know that "max == 1" is not taken advantage of.
Thirdly, match
is potentially slower when we seek only one value in another vector of values because match
's setup is rather complicated and costly:
x = 5; set.seed(199); tab = sample(1e6)
identical(match(x, tab), which.max(x == tab))
#[1] TRUE
microbenchmark::microbenchmark(match(x, tab), which.max(x == tab), times = 25)
#Unit: milliseconds
# expr min lq median uq max neval
# match(x, tab) 142.22327 142.50103 142.79737 143.19547 145.37669 25
# which.max(x == tab) 18.91427 18.93728 18.96225 19.58932 38.34253 25
To sum up, a way to work on the "list" structure of a "data.frame" and to stop computations when we find a "1", could be a loop like the following:
ff = function(x)
{
x = as.list(x)
ans = as.integer(x[[1]])
for(i in 2:length(x)) {
inds = ans == 0L
if(!any(inds)) return(ans)
ans[inds] = i * (x[[i]][inds] == 1)
}
return(ans)
}
And the solutions in the other answers (ignoring the extra steps for the output):
david = function(x) max.col(x, "first")
plafort = function(x) apply(x, 1, match, x = 1)
ff(df[-1])
#[1] 1 3 4 1
david(df[-1])
#[1] 1 3 4 1
plafort(df[-1])
#[1] 1 3 4 1
And some benchmarks:
set.seed(007)
DF = data.frame(id = seq_len(1e6),
"colnames<-"(matrix(sample(0:1, 1e7, T, c(0.25, 0.75)), 1e6),
paste("in", 11:20, sep = "")))
identical(ff(DF[-1]), david(DF[-1]))
#[1] TRUE
identical(ff(DF[-1]), plafort(DF[-1]))
#[1] TRUE
microbenchmark::microbenchmark(ff(DF[-1]), david(DF[-1]), as.matrix(DF[-1]), times = 30)
#Unit: milliseconds
# expr min lq median uq max neval
# ff(DF[-1]) 64.83577 65.45432 67.87486 70.32073 86.72838 30
# david(DF[-1]) 112.74108 115.12361 120.16118 132.04803 145.45819 30
# as.matrix(DF[-1]) 20.87947 22.01819 27.52460 32.60509 45.84561 30
system.time(plafort(DF[-1]))
# user system elapsed
# 4.117 0.000 4.125
Not really an apocalypse, but worth to see that simple, straightforward algorithmic approaches can -indeed- prove to be equally good or even better depending on the problem. Obviously, (most) other times looping in R can be laborious.
Upvotes: 8
Reputation: 28441
df$FirstYear <- gsub('in', '20', names(df))[apply(df, 1, match, x=1)]
df
id in05 in06 in07 in08 in09 FirstYear
1 a 1 0 1 0 0 2005
2 b 0 0 1 1 0 2007
3 c 0 0 0 1 0 2008
4 d 1 1 1 1 1 2005
There are many ways to do it. I used match
because it will find the first instance of a specified value. The other parts of the code are for presentation. First going line by line with apply
and naming the years by the column names with names
. The assignment <-
and df$FirstYear
is a way to add the result to the data frame.
added credit @David Arenburg has a cool idea about subbing the in
for 20
in the FirstYear
column.
Upvotes: 21
Reputation: 92282
You can vectorize using max.col
indx <- names(df)[max.col(df[-1], ties.method = "first") + 1L]
df$firstyear <- as.numeric(sub("in", "20", indx))
df
# id in05 in06 in07 in08 in09 firstyear
# 1 a 1 0 1 0 0 2005
# 2 b 0 0 1 1 0 2007
# 3 c 0 0 0 1 0 2008
# 4 d 1 1 1 1 1 2005
Upvotes: 25