Melih Aras
Melih Aras

Reputation: 369

how to select first value of the grouped data and compare it in R?

I want to compare the first value of x1 and x2 in the grouped (grouped by ID) dataset. If the first grouped value of x1 is greater than the first grouped value of x2, I will assign ID as 1 otherwise 0. Let me show you this in an example. You can see my input variable below

dt<-data.frame(ID=c(100, 100, 101, 101, 101), x1=c(1200, 1600, 1350, 1400, 1500), 
                        x2=c(1100, 1410, 1900, 1300, 1100))

Since 1200 > 1100, I will assign 1 to ID 100 and since 1350 < 1900, I will assign 0 to ID 101. Finally, my output will be

res<-data.frame(ID=c(100, 101), res=c(1,0))

how can I do that?

Thanks

Upvotes: 0

Views: 80

Answers (3)

akrun
akrun

Reputation: 887048

We can do

library(dplyr)
dt %>%
   group_by(ID) %>%
    summarise(res = +(first(x1) > first(x2)))

Upvotes: 1

Anoushiravan R
Anoushiravan R

Reputation: 21908

You can also use the following solution. I hope I got what you have in mind right:

library(dplyr)

dt %>%
  group_by(ID) %>%
  summarise(res = ifelse(first(x1) > first(x2), 1, 0))

# A tibble: 2 x 2
     ID   res
  <dbl> <dbl>
1   100     1
2   101     0

Upvotes: 1

user438383
user438383

Reputation: 6206

You can group by using dplyr and then access the first element of each group using [1] and then compare them using an if_else statement in summarise

dt %>% 
    dplyr::group_by(ID) %>% 
    dplyr::summarise(res = dplyr::if_else(x1[1] > x2[1], 1, 0))

Output:

# A tibble: 2 x 2
     ID   res
  <dbl> <dbl>
1   100     1
2   101     0

For completeness here is a data.table version and a benchmark.

dt[, .(z = ifelse(x1[1] > x2[1], 1, 0)), by=ID]
> dt = data.table(ID = rep(100:1000, each=1000), x1 = sample(901000), x2 = sample(901000))
> 
> 
> microbenchmark::microbenchmark(
... dplyr = dt %>% 
... dplyr::group_by(ID) %>% 
... dplyr::summarise(res = dplyr::if_else(x1[1] > x2[1], 1, 0)),
... 
... 
... data.table = dt[, .(z = ifelse(x1[1] > x2[1], 1, 0)), by=ID]
... )
Unit: milliseconds
       expr       min        lq     mean    median       uq       max neval
      dplyr 39.167330 42.806415 46.91723 44.422384 46.28869 125.31500   100
 data.table  9.497764  9.844758 10.94920  9.930658 10.53419  22.87746   100

So if time is of the essence, then the data.table version is ~4x faster.

Upvotes: 1

Related Questions