Reputation: 83
I am looking for a way to fill in a column with values taken from the same row but from different columns of the same data and I would like that the column from which the value is taken is chosen randomly.
My data looks as follows:
sic.code.1 sic.code2 sic.code3 sic.code4
7361 6211 NA NA
6719 NA NA NA
2329 NA 5065 5411
2869 3674 6282 NA
6282 6282 NA NA
6282 NA NA NA
I would like to create a new column "sic.code.final" in which every row is filled either with the unique non na value (for example 6719 in second row or 6282 in row 6), or, if there are other non na values in the other "sic.code" columns, it should be filled with a value taken from one of them, chosen randomly.
One of the possible realizations of my expected data could be:
sic.code.1 sic.code2 sic.code3 sic.code4 sic.code.final
7361 6211 NA NA 6211
6719 NA NA NA 6719
2329 5065 5411 NA 2329
2869 3674 6282 NA 3674
6282 6282 NA NA 6282
6282 NA NA NA 6282
Any help would be appreciated!
EDIT
In my data there are some rows in which all the columns have NAs:
sic.code.1 sic.code2 sic.code3 sic.code4
7361 6211 NA NA
6719 NA NA NA
2329 NA 5065 5411
2869 3674 6282 NA
6282 6282 NA NA
NA NA NA NA
Upvotes: 0
Views: 56
Reputation: 46886
And another base approach... I think it differs from James Bonkowski's in a more robust handling of rows where only 1 value is not NA. Also somehow it seems useful to separate the definition of fun
from its application, for understanding the code and for code re-use.
Consider the problem of sampling from a vector (a row, in our use case). There are two challenges. The first is that we only want to sample the non-NA values. The second is that sample()
treats a vector of length >1 different from a vector of length 1, as documented on ?sample
. The following function addresses these two problems...
fun = function(x) {
x = x[!is.na(x)]
x[sample(length(x), 1)]
}
Unfortunately, this returns a length 0 vector rather than a length 1 vector when the vector contains all NA, e.g.,
> fun(NA)
logical(0)
A not-particularly-elegant change might be
fun = function(x) {
x = x[!is.na(x)]
if (length(x)) {
x[sample(length(x), 1)]
} else NA
}
The application to the problem at hand is then an easy application of apply()
and cbind()
cbind(df, final = apply(df, 1, fun))
It's frustrating to try and apply this using the 'tidyverse' approach, where I thought
df %>% rowwise() %>% mutate(final = fun(.))
would work, but it doesn't...
Upvotes: 0
Reputation: 26373
Another base approach
set.seed(42)
df[, "final"] <- df[cbind(1:nrow(df),
max.col(!is.na(df), ties.method = "random"))]
Result
df
# sic.code.1 sic.code2 sic.code3 sic.code4 final
#1 1 7 NA NA 1
#2 2 NA NA NA 2
#3 3 12 10 NA 10
#4 4 8 1 NA 4
#5 5 9 NA NA 5
#6 6 NA NA NA 6
This option makes use of the ties.method
argument of max.col
. (the default is "random"
so you don't actually need to type that out)
max.col(!is.na(df), ties.method = "random")
returns for each row the column index of a value that is not NA
. Then cbind(1:nrow(df), ...
extracts these values from your data.
data
Thanks to @JamesBonkowski
df <- data.frame(sic.code.1 = 1:6,
sic.code2 = c(7, NA, 12, 8, 9, NA),
sic.code3 = c(NA, NA, 10, 1, NA, NA),
sic.code4 = c(NA, NA, NA, NA, NA, NA))
Upvotes: 1
Reputation: 474
Here's what I'd do in base
df <- data.frame(sic.code.1 = 1:6,
sic.code2 = c(7, NA, NA, 8, 9, NA),
sic.code3 = c(NA, NA, 10, 1, NA, NA),
sic.code4 = c(NA, NA, 12, NA, NA, NA))
cbind(df, sic.code.final = apply(df, 1, function(x) sample(rep(x[!is.na(x)], 2), 1)))
Upvotes: 1