MSM
MSM

Reputation: 85

Comparing two dataframes (with different length) based on one column, and retrieve the other columns in r

I have a two dataframe. first data frame has only one column and the second dataframe has 7 columns and 30000 rows. Dataframe 1:

coulmn1
otu_1
otu_2
otu_3
otu_4
otu_5
otu_6
otu_7
otu_8
otu_9
otu_10

Second dataframe:

    otu1    otu2    Name.x  Name.y
    otu_1   otu_2   Gemmiger    Bacteroides
    OTU_3   otu_1  Bifido   Gemmiger
    otu_4   otu_5   Fusobacterium   fags
    otu_6   otu_7   Dialister   gems
    otu_8   otu_9   Streptococcus   hen
    OTU_10  OTU_6   Clostridium IV  Dialister
    OTU_11  OTU_16  Clostridium IV  Dialister

Now, I have to compare Dataframe 1 with the two columns (otu1 and otu2) in dataframe 2 and get their corresponding values in Name.x and Name.y (names are not unique for each id).i.e several otu ids may have the same name. But the otu ids are unique

The desire output will be:

    coulmn1   Name
otu_1     Gemmiger
otu_2     Bacteroides
otu_3     Bifido
otu_4     Fusobacterium
otu_5     fags
otu_6     Dialister 
otu_7     gems
otu_8     Streptococcus
otu_9     hen
otu_10    Clostridium IV
otu_11    Clostridium IV    
otu_16    Dialister

Upvotes: 0

Views: 177

Answers (1)

jazzurro
jazzurro

Reputation: 23574

Here is my attempt. I converted the original data to a long-format data. Then, I defined a group and replaced the 2nd element in otu1 in each group with the 2nd element in otu2 in the same group. Removing two columns (i.e., otu2 and name), I obtained unique values in the end.

pivot_longer(data = df2, cols = starts_with("Name"),
             values_to = "Name") %>% 
group_by(otu2) %>% 
mutate(otu1 = tolower(if_else(row_number() == n(), otu2, otu1))) %>% 
ungroup %>% 
select(-c(otu2, name)) %>% 
distinct(.keep_all = TRUE) %>%
rename(col1 = "otu1")

    col1   Name          
   <chr>  <chr>         
 1 otu_1  Gemmiger      
 2 otu_2  Bacteroides   
 3 otu_3  Bifido        
 4 otu_4  Fusobacterium 
 5 otu_5  fags          
 6 otu_6  Dialister     
 7 otu_7  gems          
 8 otu_8  Streptococcus 
 9 otu_9  hen           
10 otu_10 Clostridium IV
11 otu_11 Clostridium IV
12 otu_16 Dialister  

Data

df2 <- structure(list(otu1 = c("otu_1", "OTU_3", "otu_4", "otu_6", "otu_8", 
"OTU_10", "OTU_11"), otu2 = c("otu_2", "otu_1", "otu_5", "otu_7", 
"otu_9", "OTU_6", "OTU_16"), Name.x = c("Gemmiger", "Bifido", 
"Fusobacterium", "Dialister", "Streptococcus", "Clostridium IV", 
"Clostridium IV"), Name.y = c("Bacteroides", "Gemmiger", "fags", 
"gems", "hen", "Dialister", "Dialister")), class = "data.frame", row.names = c(NA, 
-7L))

Upvotes: 1

Related Questions