Matching across datasets and columns

Question

I have a vector with words, e.g., like this:

 w <- LETTERS[1:5]

and a dataframe with tokens of these words but also tokens of other words in different columns, e.g., like this:

set.seed(21)
df <- data.frame(
  w1 = c(sample(LETTERS, 10)),
  w2 = c(sample(LETTERS, 10)),
  w3 = c(sample(LETTERS, 10)),
  w4 = c(sample(LETTERS, 10))
)
df
   w1 w2 w3 w4
1   U  R  A  Y
2   G  X  P  M
3   Q  B  S  R
4   E  O  V  T
5   V  D  G  W
6   T  A  Q  E
7   C  K  L  U
8   D  F  O  Z
9   R  I  M  G
10  O  T  T  I
# convert factor to character:
df[] <- lapply(df[], as.character)

I'd like to extract from dfall the tokens of those words that are contained in the vector w. I can do it like this but that doesn't look nice and is highly repetitive and error prone if the dataframe is larger:

extract <- c(df$w1[df$w1 %in% w],
             df$w2[df$w2 %in% w], 
             df$w3[df$w3 %in% w], 
             df$w4[df$w4 %in% w])

I tried this, using paste0 to avoid addressing each column separately but that doesn't work:

extract <- df[paste0("w", 1:4)][df[paste0("w", 1:4)] %in% w]
extract
data frame with 0 columns and 10 rows

What's wrong with this code? Or which other code would work?

user10191355 · Accepted Answer

To answer your question, "What's wrong with this code?": The code df[paste0("w", 1:4)][df[paste0("w", 1:4)] %in% w] is the equivalent of df[df %in% w] because df[paste0("w", 1:4)], which you use twice, simply returns the entirety of df. That means df %in% w will return FALSE FALSE FALSE FALSE because none of the variables in df are in w (w contains strings but not vectors of strings), and df[c(F, F, F, F)] returns an empty data frame.

If you're dealing with a single data type (strings), and the output can be a character vector, then use a matrix instead of a data frame, which is faster and is, in this case, a little easier to subset:

mat <- as.matrix(df)
mat[mat %in% w]

#[1] "B" "D" "E" "E" "A" "B" "E" "B"

This produces the same output as your attempt above with extract <- ….

If you want to keep some semblance of the original data frame structure then you can try the following, which outputs a list (necessary as the returned vectors for each variable might have different lengths):

lapply(df, function(x) x[x %in% w])

#### OUTPUT ####
$w1
[1] "B" "D" "E"

$w2
[1] "E" "A"

$w3
[1] "B"

$w4
[1] "E" "B"

Just call unlist or unclass on the returned list if you want a vector.

Matching across datasets and columns

Answers (1)

Related Questions