Reputation: 51

Check if names are found in different columns and which one

I have a data frame with 4 columns, each column represent a different treatment. Each column is fill with protein numbers on it and the columns have different number of rows between each other. Theres a way to compare all 4 columns and have as a result a fifth column saying if a value is found in which of the columns? I know I have some values that will happen in two or even maybe 3 of the colums and I was wondering if theres a way to get this as end result in a new column.

I tried Data$A %in% Data$B but this just gives me TRUE or FALSE between two columns. I was looking for some option like match or even contain, but all options seens that can only give me a true or false answer.

What I need is something like this.

    A   B   C
1   DSFG    DSFG    DSGG
2   DDEG    DDED    DDEE    
3   HUGO    HUGI    HUGO

So if this is my table, I want the result like this

    D(?)    E
1   DSFG    A,B
2   DSGG    C
4   DDEG    A
5   DDED    B
6   DDEE    C
7   HUGO    A,C
8   HUGI    B

Upvotes: 1

Answers (2)

Sotos

Reputation: 51582

Solution

An idea via base R is to use stack to convert to long, and aggregate to get the required output.

aggregate(ind ~ values, stack(df), toString)

#  values  ind
#1   DDED    B
#2   DDEE    C
#3   DDEG    A
#4   DSFG A, B
#5   DSGG    C
#6   HUGI    B
#7   HUGO A, C

NOTE: Your columns need to be as.character for this to work. (df[] <- lapply(df, as.character))

Explanations

Stacking turns data into "long format":

stack(df)
values ind
1   DSFG   A
2   DDEG   A
3   HUGO   A
4   DSFG   B
5   DDED   B
6   HUGI   B
7   DSGG   C
8   DDEE   C
9   HUGO   C

toString() simply joins elements in a vector by comma

toString(c("A", "B", "C"))
[1] "A, B, C"

Aggregating returns a vector of "ind"s for each value, and these are then turned into a string using the function above:

aggregate(ind ~ values, stack(df), FUN=toString)

Upvotes: 3

Esben Eickhardt

Reputation: 3842

Doing it the tidy way:

Input

df <- data.frame(A = c("DSFG", "DDEG", "HUGO"), B = c("DSFG", "DDED", "HUGI"), C = c("DSGG", "DDEE", "HUGO"))

Summarizing data

library(tidyverse)
df %>%
  gather("Column", "Value", 1:3) %>%
  group_by(Value) %>%
  summarise(Cols = paste(Column, collapse = ","))

Output

Value Cols
DDED  B    
DDEE  C    
DDEG  A    
DSFG  A,B  
DSGG  C    
HUGI  B    
HUGO  A,C

Upvotes: 0