Reputation: 83
I have a large dataset (approx. 500,000 x 5,000) where I need to do the following:
Search the entire dataset for a number of different strings (up to 30 at a time).
Depending on the column that the string occurs in, to retrieve information from another column within the same row. The columns are linked, so column V1
is linked to D1
and V2
to D2
. Therefore, if a string is identified in, for example, V2
row 3, I would like to extract the value in D2
row 3.
The value from a fixed column of the same row (in my case ID
) should also be extracted.
I have already been able to use which
to identify the row and column numbers for each of the strings but have been unable to work out the next step to extract additional column values depending on the results of the search.
An example data.frame
:
df <- data.frame("ID" = 1:5,
"V1" = c("A1", "A2", "A1", "A3", "A4"),
"V2" = c("A6", "NA", "A3", "A2", "A1"),
"D1" = c("B3", "B3", "B2", "B1", "B5"),
"D2" = c("B8", "NA", "B2", "B6", "B2"),
stringsAsFactors=FALSE)
ID V1 V2 D1 D2
1 A1 A6 B3 B8
2 A2 NA B3 NA
3 A1 A3 B2 B2
4 A3 A2 B1 B6
5 A4 A1 B5 B2
When the code is run to search for "A1", it retrieves the corresponding ID
and relevant D
variable in a data frame or list format:
ID: 1, D1: B3
ID: 3, D1: B2
ID: 5, D2: B2
Upvotes: 2
Views: 537
Reputation: 389055
In base R, we can separate "V"
and "D"
cols, find occurrence of "A1"
from V_cols
get corresponding ID
and since V_cols
and D_cols
are linked we can get corresponding values from D_cols
using which
with row and column index.
V_cols <- grep("^V", names(df), value = TRUE)
D_cols <- grep("^D", names(df), value = TRUE)
mat <- which(df[V_cols] == "A1", arr.ind = TRUE)
data.frame(ID = df$ID[mat[, 1]], D_var = df[D_cols][mat], D_col = D_cols[mat[, 2]])
# ID D_var D_col
#1 1 B3 D1
#2 3 B2 D1
#3 5 B2 D2
Upvotes: 0
Reputation: 33488
Here is one option using data.table
:
library(data.table)
setDT(df)
df_melted <- melt(df, id.vars = "ID", measure.vars = patterns(V = '^V', D = '^D'))
df_melted
# ID variable V D
# 1: 1 1 A1 B3
# 2: 2 1 A2 B3
# 3: 3 1 A1 B2
# 4: 4 1 A3 B1
# 5: 5 1 A4 B5
# 6: 1 2 A6 B8
# 7: 2 2 NA NA
# 8: 3 2 A3 B2
# 9: 4 2 A2 B6
# 10: 5 2 A1 B2
strings <- c("A1")
df_melted[V %chin% strings, paste0('ID: ', ID, ', D', variable, ": ", D)]
# "ID: 1, D1: B3" "ID: 3, D1: B2" "ID: 5, D2: B2"
Upvotes: 1