K K
K K

Reputation: 77

how to get index of first occurence of group in a column?

 C1 C2
------
a   11
a   2
a   2
b   2
b   34
c   2
c   4
c   1
d   4

how can i get index of a groupname first occurence

for example: in column A first occurence of 'b' index is 4 like that i need to get all indexes of first occurence of group

Upvotes: 4

Views: 448

Answers (5)

akrun
akrun

Reputation: 887048

Using ave

with(df, which(as.logical(ave(seq_along(C1), C1,
     FUN = function(x) x == x[1]))))
#[1] 1 4 6 9

Upvotes: 0

ThomasIsCoding
ThomasIsCoding

Reputation: 101247

Try tapply + head like below

with(
  df,
  tapply(seq_along(C1), C1, head, 1)
)

which gives

a b c d
1 4 6 9

Or we can use aggregate

> aggregate(cbind(idx = seq_along(C1)) ~ C1, df, head, 1)
  C1 idx
1  a   1
2  b   4
3  c   6
4  d   9

Upvotes: 2

denis
denis

Reputation: 5673

To add on the already present answers, with base R, using tapply:

dt$I <- 1:nrow(dt)
tapply(dt$I, dt$C1, function(x) x[1])

a b c d 
1 4 6 9 

If you want two column, the group and the index, with dplyr you could use cur_group_rows, the equivalent of .I in data.table, see https://dplyr.tidyverse.org/reference/context.html?q=grp#data-table

  dt %>%
    group_by(C1) %>%
    summarise(cur_group_rows()[1])

# A tibble: 4 x 2
  C1    index
  <fct> <int>
1 a         1
2 b         4
3 c         6
4 d         9

A bit of comparison:

only the index

denis = function(){
  tapply(dt$I, dt$C1, function(x) x[1])
}

mt1022 = function(){
  which(!duplicated(dt$C1))
}

microbenchmark(mt1022(),denis())

Unit: microseconds
     expr  min   lq    mean median    uq   max neval cld
 mt1022() 19.5 23.7  46.705   29.9  48.9 525.2   100  a 
  denis() 61.7 66.0 124.323   89.5 133.1 735.3   100   b

@mt1022 method is much faster

if you want the two column table:

library(dplyr)
library(data.table)

mt1022_datatable = function(){
  as.data.table(dt)[, .(index = .I[1]), by = .(C1)]
}


jmpivette = function(){
  dt %>%
    mutate(r_number = row_number()) %>%
    group_by(C1) %>%
    summarise(r_number[1])
}

denis_dplyr = function(){
  dt %>%
    group_by(C1) %>%
    summarise(index = cur_group_rows()[1])
}

microbenchmark(mt1022_datatable(),jmpivette(),denis_dplyr())

Unit: milliseconds
               expr    min      lq      mean  median      uq     max neval cld
 mt1022_datatable() 1.4469 1.72520  2.234030 2.01225 2.30720  8.9519   100 a  
        jmpivette() 6.6528 7.31915 10.029003 7.94435 8.89835 56.7763   100   c
      denis_dplyr() 4.4943 4.92120  7.057608 5.38290 6.13925 41.9592   100  b 

Here you see the advantage of data.table


data:

dt <- read.table(text = "C1 C2
  a   11
a   2
a   2
b   2
b   34
c   2
c   4
c   1
d   4
",header = T)

Upvotes: 1

jmpivette
jmpivette

Reputation: 275

library(dplyr)

df <- data.frame(C1 = c("a","a","a","b","b","c","c","c","d"),
                 C2 = c(11,2,2,2,34,2,4,1,4))

df %>%
  mutate(r_number = row_number()) %>%
  group_by(C1) %>%
  summarise(index = min(r_number))

#> # A tibble: 4 x 2
#>   C1    index
#>   <chr> <int>
#> 1 a         1
#> 2 b         4
#> 3 c         6
#> 4 d         9

Upvotes: 3

mt1022
mt1022

Reputation: 17289

With data.table package, you can get it with .I:

as.data.table(dtt)[, .(index = .I[1]), by = .(C1)]
#    C1 index
# 1:  a     1
# 2:  b     4
# 3:  c     6
# 4:  d     9

If only indices are need:

which(!duplicated(dtt$C1))
[1] 1 4 6 9

Upvotes: 5

Related Questions