Sean Zheng
Sean Zheng

Reputation: 83

How to search a data frame for multiple strings and pull different column contents depending on string column and row

I have a large dataset (approx. 500,000 x 5,000) where I need to do the following:

  1. Search the entire dataset for a number of different strings (up to 30 at a time).

  2. Depending on the column that the string occurs in, to retrieve information from another column within the same row. The columns are linked, so column V1 is linked to D1 and V2 to D2. Therefore, if a string is identified in, for example, V2 row 3, I would like to extract the value in D2 row 3.

  3. The value from a fixed column of the same row (in my case ID) should also be extracted.

I have already been able to use which to identify the row and column numbers for each of the strings but have been unable to work out the next step to extract additional column values depending on the results of the search.

An example data.frame:

df <- data.frame("ID" = 1:5,
                 "V1" = c("A1", "A2", "A1", "A3", "A4"),
                 "V2" = c("A6", "NA", "A3", "A2", "A1"),
                 "D1" = c("B3", "B3", "B2", "B1", "B5"),
                 "D2" = c("B8", "NA", "B2", "B6", "B2"), 
                 stringsAsFactors=FALSE)

ID V1 V2 D1 D2
1  A1 A6 B3 B8   
2  A2 NA B3 NA   
3  A1 A3 B2 B2 
4  A3 A2 B1 B6 
5  A4 A1 B5 B2

When the code is run to search for "A1", it retrieves the corresponding ID and relevant D variable in a data frame or list format:

 ID: 1, D1: B3
 ID: 3, D1: B2
 ID: 5, D2: B2

Upvotes: 2

Views: 537

Answers (2)

Ronak Shah
Ronak Shah

Reputation: 389055

In base R, we can separate "V" and "D" cols, find occurrence of "A1" from V_cols get corresponding ID and since V_cols and D_cols are linked we can get corresponding values from D_cols using which with row and column index.

V_cols <- grep("^V", names(df), value = TRUE)
D_cols <- grep("^D", names(df), value = TRUE)

mat <- which(df[V_cols] == "A1", arr.ind = TRUE)
data.frame(ID = df$ID[mat[, 1]], D_var = df[D_cols][mat], D_col = D_cols[mat[, 2]])

#  ID D_var D_col
#1  1    B3    D1
#2  3    B2    D1
#3  5    B2    D2

Upvotes: 0

s_baldur
s_baldur

Reputation: 33488

Here is one option using data.table:

library(data.table)
setDT(df)
df_melted <- melt(df, id.vars = "ID",  measure.vars = patterns(V = '^V', D = '^D'))
df_melted 
#     ID variable  V  D
#  1:  1        1 A1 B3
#  2:  2        1 A2 B3
#  3:  3        1 A1 B2
#  4:  4        1 A3 B1
#  5:  5        1 A4 B5
#  6:  1        2 A6 B8
#  7:  2        2 NA NA
#  8:  3        2 A3 B2
#  9:  4        2 A2 B6
# 10:  5        2 A1 B2

strings <- c("A1")
df_melted[V %chin% strings, paste0('ID: ', ID, ', D', variable, ": ", D)]
# "ID: 1, D1: B3" "ID: 3, D1: B2" "ID: 5, D2: B2"

Upvotes: 1

Related Questions