Jason Matney
Jason Matney

Reputation: 552

How do I infill non-adjacent rows with sample data from previous rows in R?

I have data containing a unique identifier, a category, and a description. Below is a toy dataset.

prjnumber <- c(1,2,3,4,5,6,7,8,9,10)
category <- c("based","trill","lit","cold",NA,"epic", NA,NA,NA,NA)
description <- c("skip class",
                 "dunk on brayden",
                 "record deal",
                 "fame and fortune",
                 NA,
                 "female attention",
                 NA,NA,NA,NA)
toy.df <- data.frame(prjnumber, category, description)

> toy.df
       prjnumber category      description
    1          1    based       skip class
    2          2    trill  dunk on brayden
    3          3      lit      record deal
    4          4     cold fame and fortune
    5          5     <NA>             <NA>
    6          6     epic female attention
    7          7     <NA>             <NA>
    8          8     <NA>             <NA>
    9          9     <NA>             <NA>
    10        10     <NA>             <NA>

I want to randomly sample the 'category' and 'description' columns from rows that have been filled in to use as infill for rows with missing data. The final data frame would be complete and would only rely on the initial 5 rows which contain data. The solution would preserve between-column correlation. An expected output would be:

> toy.df
       prjnumber category      description
    1          1    based       skip class
    2          2    trill  dunk on brayden
    3          3      lit      record deal
    4          4     cold fame and fortune
    5          5      lit      record deal
    6          6     epic female attention
    7          7    based       skip class
    8          8    based       skip class
    9          9     lit       record deal
    10        10   trill   dunk on brayden

Upvotes: 2

Views: 105

Answers (3)

akrun
akrun

Reputation: 887481

You could try

library(dplyr)
toy.df %>%
      mutate_each(funs(replace(., is.na(.), sample(.[!is.na(.)]))), 2:3) 

Based on new information, we may need a numeric index to use in the funs.

toy.df %>% 
   mutate(indx= replace(row_number(), is.na(category), 
           sample(row_number()[!is.na(category)], replace=TRUE)))  %>%
   mutate_each(funs(.[indx]), 2:3) %>% 
   select(-indx)

Upvotes: 5

Jthorpe
Jthorpe

Reputation: 10196

Using Base R to fill in a single field a at a time, use something like (not preserving the correlation between the fields):

fields  <-  c('category','description')
for(field in fields){
    missings  <-  is.na(toy.df[[field]])
    toy.df[[field]][missings]  <-  sample(toy.df[[field]][!missings],sum(missings),T)
}

and to fill them in simultaneously (preserving the correlation between the fields) use something like:

missings  <-  apply(toy.df[,fields],
                    1,
                    function(x)any(is.na(x)))

toy.df[missings,fields]  <-  toy.df[!missings,fields][sample(sum(!missings),
                                                           sum(missings),
                                                           T),]

and of course, to avoid the implicit for loop in the apply(x,1,fun), you could use:

rowAny <- function(x) rowSums(x) > 0
missings  <-  rowAny(toy.df[,fields])

Upvotes: 2

Gregor Thomas
Gregor Thomas

Reputation: 145965

complete = na.omit(toy.df)
toy.df[is.na(toy.df$category), c("category", "description")] =
    complete[sample(1:nrow(complete), size = sum(is.na(toy.df$category)), replace = TRUE),
             c("category", "description")]
toy.df
#    prjnumber category      description
# 1          1    based       skip class
# 2          2    trill  dunk on brayden
# 3          3      lit      record deal
# 4          4     cold fame and fortune
# 5          5      lit      record deal
# 6          6     epic female attention
# 7          7     cold fame and fortune
# 8          8    based       skip class
# 9          9     epic female attention
# 10        10     epic female attention

Though it would seem a little more straightforward if you didn't start with the unique identifiers filled out for the NA rows...

Upvotes: 5

Related Questions