wkmor1
wkmor1

Reputation: 7396

Repeat each row of data.frame the number of times specified in a column

df <- data.frame(var1 = c('a', 'b', 'c'), var2 = c('d', 'e', 'f'),
                 freq = 1:3)

What is the simplest way to expand each row the first two columns of the data.frame above, so that each row is repeated the number of times specified in the column 'freq'?

In other words, go from this:

df
  var1 var2 freq
1    a    d    1
2    b    e    2
3    c    f    3

To this:

df.expanded
  var1 var2
1    a    d
2    b    e
3    b    e
4    c    f
5    c    f
6    c    f

Upvotes: 204

Views: 122854

Answers (10)

a1a5a6
a1a5a6

Reputation: 65

I am providing one more addition to this wonderful thread of nice answers! Use the tidyr package (included in tidyverse) for a one-liner solution:

df %>% tidyr::uncount(weights = freq)

Upvotes: 4

yuanzz
yuanzz

Reputation: 1970

in fact. use the methods of vector and index. we can also achieve the same result, and more easier to understand:

rawdata <- data.frame('time' = 1:3, 
           'x1' = 4:6,
           'x2' = 7:9,
           'x3' = 10:12)

rawdata[rep(1, time=2), ] %>% remove_rownames()
#  time x1 x2 x3
# 1    1  4  7 10
# 2    1  4  7 10


Upvotes: 0

rdornas
rdornas

Reputation: 652

I know this is not the case but if you need to keep the original freq column, you can use another tidyverse approach together with rep:

library(purrr)

df <- data.frame(var1 = c('a', 'b', 'c'), var2 = c('d', 'e', 'f'), freq = 1:3)

df %>% 
  map_df(., rep, .$freq)
#> # A tibble: 6 x 3
#>   var1  var2   freq
#>   <fct> <fct> <int>
#> 1 a     d         1
#> 2 b     e         2
#> 3 b     e         2
#> 4 c     f         3
#> 5 c     f         3
#> 6 c     f         3

Created on 2019-12-21 by the reprex package (v0.3.0)

Upvotes: 8

Ronak Shah
Ronak Shah

Reputation: 388797

Another dplyr alternative with slice where we repeat each row number freq times

library(dplyr)

df %>%  
  slice(rep(seq_len(n()), freq)) %>% 
  select(-freq)

#  var1 var2
#1    a    d
#2    b    e
#3    b    e
#4    c    f
#5    c    f
#6    c    f

seq_len(n()) part can be replaced with any of the following.

df %>% slice(rep(1:nrow(df), freq)) %>% select(-freq)
#Or
df %>% slice(rep(row_number(), freq)) %>% select(-freq)
#Or
df %>% slice(rep(seq_len(nrow(.)), freq)) %>% select(-freq)

Upvotes: 10

M--
M--

Reputation: 28826

Another possibility is using tidyr::expand:

library(dplyr)
library(tidyr)

df %>% group_by_at(vars(-freq)) %>% expand(temp = 1:freq) %>% select(-temp)
#> # A tibble: 6 x 2
#> # Groups:   var1, var2 [3]
#>   var1  var2 
#>   <fct> <fct>
#> 1 a     d    
#> 2 b     e    
#> 3 b     e    
#> 4 c     f    
#> 5 c     f    
#> 6 c     f

One-liner version of vonjd's answer:

library(data.table)

setDT(df)[ ,list(freq=rep(1,freq)),by=c("var1","var2")][ ,freq := NULL][]
#>    var1 var2
#> 1:    a    d
#> 2:    b    e
#> 3:    b    e
#> 4:    c    f
#> 5:    c    f
#> 6:    c    f

Created on 2019-05-21 by the reprex package (v0.2.1)

Upvotes: 5

Max Ghenis
Max Ghenis

Reputation: 15783

@neilfws's solution works great for data.frames, but not for data.tables since they lack the row.names property. This approach works for both:

df.expanded <- df[rep(seq(nrow(df)), df$freq), 1:2]

The code for data.table is a tad cleaner:

# convert to data.table by reference
setDT(df)
df.expanded <- df[rep(seq(.N), freq), !"freq"]

Upvotes: 28

einar
einar

Reputation: 2665

old question, new verb in tidyverse:

library(tidyr) # version >= 0.8.0
df <- data.frame(var1=c('a', 'b', 'c'), var2=c('d', 'e', 'f'), freq=1:3)
df %>% 
  uncount(freq)

    var1 var2
1      a    d
2      b    e
2.1    b    e
3      c    f
3.1    c    f
3.2    c    f

Upvotes: 121

Sam Firke
Sam Firke

Reputation: 23004

Use expandRows() from the splitstackshape package:

library(splitstackshape)
expandRows(df, "freq")

Simple syntax, very fast, works on data.frame or data.table.

Result:

    var1 var2
1      a    d
2      b    e
2.1    b    e
3      c    f
3.1    c    f
3.2    c    f

Upvotes: 49

vonjd
vonjd

Reputation: 4380

In case you have to do this operation on very large data.frames I would recommend converting it into a data.table and use the following, which should run much faster:

library(data.table)
dt <- data.table(df)
dt.expanded <- dt[ ,list(freq=rep(1,freq)),by=c("var1","var2")]
dt.expanded[ ,freq := NULL]
dt.expanded

See how much faster this solution is:

df <- data.frame(var1=1:2e3, var2=1:2e3, freq=1:2e3)
system.time(df.exp <- df[rep(row.names(df), df$freq), 1:2])
##    user  system elapsed 
##    4.57    0.00    4.56
dt <- data.table(df)
system.time(dt.expanded <- dt[ ,list(freq=rep(1,freq)),by=c("var1","var2")])
##    user  system elapsed 
##    0.05    0.01    0.06

Upvotes: 8

neilfws
neilfws

Reputation: 33772

Here's one solution:

df.expanded <- df[rep(row.names(df), df$freq), 1:2]

Result:

    var1 var2
1      a    d
2      b    e
2.1    b    e
3      c    f
3.1    c    f
3.2    c    f

Upvotes: 202

Related Questions