Frequency list from multiple columns in dataframe based on restriction

Question

I have a df containing words (columns w1, w2, etc.) and their durations, some of which are NA (columns d1, d2, etc.), like this one:

set.seed(47)
df <- data.frame(
  w1 = c(sample(LETTERS[1:4], 10, replace = T)),
  w2 = c(sample(LETTERS[1:4], 10, replace = T)),
  w3 = c(sample(LETTERS[1:4], 10, replace = T)),
  w4 = c(sample(LETTERS[1:4], 10, replace = T)),
  d1 = c(rep(NA, 3), round(rnorm(7),3)),
  d2 = c(round(rnorm(6),3), NA, round(rnorm(3),3)),
  d3 = c(round(rnorm(2),3), rep(NA,2), round(rnorm(6),3)),
  d4 = c(round(rnorm(1),3), NA, round(rnorm(8),3))
)

   w1 w2 w3 w4     d1     d2     d3     d4
1   D  A  A  C     NA -2.322 -0.693 -0.488
2   B  C  C  B     NA -1.967  0.261     NA
3   D  A  C  B     NA  0.028     NA  -0.92
4   D  C  A  A -1.566  0.484     NA  0.898
5   C  C  C  D  0.249  0.144  0.507 -0.356
6   C  D  B  B  -0.34   -1.2  0.564  1.032
7   B  B  A  A  0.417     NA  0.061  0.664
8   B  A  A  D -0.326  0.885 -0.109   0.97
9   C  A  C  B  -0.89  0.887 -0.155  1.676
10  D  B  D  C -1.608  0.001   0.95  1.988

What I'd like to get is a single frequency list of all those word tokens that are not NA in the corresponding duration column. So, for example, "D" in Column w1 is NA in d1 so this token should not be included in the frequency count. How is this programmed in base R, ideally in a single line of code?

Gregor Thomas · Accepted Answer

Ignoring values that are NA in their corresponding columns:

table(unlist(replace(df[paste0("w", 1:4)], is.na(df[paste0("d", 1:4)]), NA)))
#  B  C  D  A 
#  7 11  6  9

# Alternate approach
table(unlist(df[1:4])[!is.na(unlist(df[5:8]))])
#  B  C  D  A 
#  7 11  6  9

Completely omitting values that have NA anywhere:

It's 3 lines, but I'd do it like this:

all_words = unlist(df[1:4])
na_words = all_words[is.na(unlist(df[5:8]))]
table(droplevels(all_words[! all_words %in% na_words]))
# < table of extent 0 >

You could do it in a single line, but it's much uglier, very hard to tell what's going on.

table(droplevels(unlist(df[1:4])[! unlist(df[1:4]) %in% unlist(df[1:4])[is.na(unlist(df[5:8]))]]))

For the given sample data, it gives a table of length 0 because all of the unique words have an NA somewhere. If you change the input data to use more letters, we get non-empty results:

set.seed(47)
df2 <- data.frame(
  w1 = c(sample(LETTERS[1:8], 10, replace = T)),
  w2 = c(sample(LETTERS[1:8], 10, replace = T)),
  w3 = c(sample(LETTERS[1:8], 10, replace = T)),
  w4 = c(sample(LETTERS[1:8], 10, replace = T)),
  d1 = c(rep(NA, 3), round(rnorm(7),3)),
  d2 = c(round(rnorm(6),3), NA, round(rnorm(3),3)),
  d3 = c(round(rnorm(2),3), rep(NA,2), round(rnorm(6),3)),
  d4 = c(round(rnorm(1),3), NA, round(rnorm(8),3))
)
table(droplevels(unlist(df2[1:4])[! unlist(df2[1:4]) %in% unlist(df2[1:4])[is.na(unlist(df2[5:8]))]]))
# F A 
# 5 4

Frequency list from multiple columns in dataframe based on restriction

Answers (1)

Related Questions