Esperanta
Esperanta

Reputation: 83

How to fill one column with values taken randomly from the same row and different columns?

I am looking for a way to fill in a column with values taken from the same row but from different columns of the same data and I would like that the column from which the value is taken is chosen randomly.

My data looks as follows:

sic.code.1 sic.code2 sic.code3 sic.code4
   7361       6211          NA       NA                                    
   6719        NA           NA       NA                                                     
   2329        NA           5065     5411                                                     
   2869       3674          6282     NA                             
   6282       6282          NA       NA
   6282        NA           NA       NA

I would like to create a new column "sic.code.final" in which every row is filled either with the unique non na value (for example 6719 in second row or 6282 in row 6), or, if there are other non na values in the other "sic.code" columns, it should be filled with a value taken from one of them, chosen randomly.

One of the possible realizations of my expected data could be:

 sic.code.1 sic.code2 sic.code3 sic.code4   sic.code.final
    7361       6211          NA       NA          6211                               
    6719        NA           NA       NA          6719                                    
    2329       5065          5411     NA          2329                                           
    2869       3674          6282     NA          3674                    
    6282       6282          NA       NA          6282
    6282        NA           NA       NA          6282

Any help would be appreciated!

EDIT

In my data there are some rows in which all the columns have NAs:

sic.code.1 sic.code2 sic.code3 sic.code4
   7361       6211          NA       NA                                    
   6719        NA           NA       NA                                                     
   2329        NA           5065     5411                                                     
   2869       3674          6282     NA                             
   6282       6282          NA       NA
   NA          NA           NA       NA

Upvotes: 0

Views: 56

Answers (3)

Martin Morgan
Martin Morgan

Reputation: 46886

And another base approach... I think it differs from James Bonkowski's in a more robust handling of rows where only 1 value is not NA. Also somehow it seems useful to separate the definition of fun from its application, for understanding the code and for code re-use.

Consider the problem of sampling from a vector (a row, in our use case). There are two challenges. The first is that we only want to sample the non-NA values. The second is that sample() treats a vector of length >1 different from a vector of length 1, as documented on ?sample. The following function addresses these two problems...

fun = function(x) {
    x = x[!is.na(x)]
    x[sample(length(x), 1)]
}

Unfortunately, this returns a length 0 vector rather than a length 1 vector when the vector contains all NA, e.g.,

> fun(NA)
logical(0)

A not-particularly-elegant change might be

fun = function(x) {
    x = x[!is.na(x)]
    if (length(x)) {
        x[sample(length(x), 1)]
    } else NA 
}

The application to the problem at hand is then an easy application of apply() and cbind()

cbind(df, final = apply(df, 1, fun))

It's frustrating to try and apply this using the 'tidyverse' approach, where I thought

df %>% rowwise() %>% mutate(final = fun(.))

would work, but it doesn't...

Upvotes: 0

markus
markus

Reputation: 26373

Another base approach

set.seed(42)
df[, "final"] <- df[cbind(1:nrow(df),
                          max.col(!is.na(df), ties.method = "random"))]

Result

df
#  sic.code.1 sic.code2 sic.code3 sic.code4 final
#1          1         7        NA        NA     1
#2          2        NA        NA        NA     2
#3          3        12        10        NA    10
#4          4         8         1        NA     4
#5          5         9        NA        NA     5
#6          6        NA        NA        NA     6

This option makes use of the ties.method argument of max.col. (the default is "random" so you don't actually need to type that out)

max.col(!is.na(df), ties.method = "random") returns for each row the column index of a value that is not NA. Then cbind(1:nrow(df), ... extracts these values from your data.

data

Thanks to @JamesBonkowski

df <- data.frame(sic.code.1 = 1:6,
                 sic.code2 = c(7, NA, 12, 8, 9, NA),
                 sic.code3 = c(NA, NA, 10, 1, NA, NA), 
                 sic.code4 = c(NA, NA, NA, NA, NA, NA))

Upvotes: 1

James B
James B

Reputation: 474

Here's what I'd do in base

df <- data.frame(sic.code.1 = 1:6,
                 sic.code2 = c(7, NA, NA, 8, 9, NA),
                 sic.code3 = c(NA, NA, 10, 1, NA, NA), 
                 sic.code4 = c(NA, NA, 12, NA, NA, NA))

cbind(df, sic.code.final = apply(df, 1, function(x) sample(rep(x[!is.na(x)], 2), 1)))

Upvotes: 1

Related Questions