ALG
ALG

Reputation: 212

R merge and left_join outputs duplicated rows

I have two data frames with this structure:

> df_gen[1:5,]              
          Genus      mean_RA
1  Unclassified 0.1357401738
2 Lactobacillus 0.0003825068
3  Prevotella 9 0.0009573787
4  Anaerovibrio 0.0049035545
5     Roseburia 0.0026672558

> df_tax[1:8,]              
   Kingdom        Phylum         Class           Order           Family         Genus
1 Bacteria Bacteroidetes   Bacteroidia   Bacteroidales   Prevotellaceae  Prevotella 9
2 Bacteria Bacteroidetes   Bacteroidia   Bacteroidales   Prevotellaceae  Prevotella 9
3 Bacteria Bacteroidetes   Bacteroidia   Bacteroidales   Prevotellaceae  Prevotella 9 
4 Bacteria    Firmicutes       Bacilli Lactobacillales Lactobacillaceae Lactobacillus
5 Bacteria    Firmicutes Negativicutes Selenomonadales  Veillonellaceae  Anaerovibrio
6 Bacteria    Firmicutes Negativicutes Selenomonadales  Veillonellaceae  Anaerovibrio
7 Bacteria    Firmicutes       Bacilli Lactobacillales Lactobacillaceae Lactobacillus
8 Bacteria    Firmicutes    Clostridia   Clostridiales  Lachnospiraceae     Roseburia

I want to merge df_gen with df_tax, but when I do every row completely duplicates, in this way:

> merge(df_gen, df_tax, by = "Genus", all.x = TRUE)
          Genus      mean_RA  Kingdom        Phylum         Class           Order           Family
1  Unclassified 0.1357401738       NA            NA            NA              NA               NA
2 Lactobacillus 0.0003825068 Bacteria    Firmicutes       Bacilli Lactobacillales Lactobacillaceae
3 Lactobacillus 0.0003825068 Bacteria    Firmicutes       Bacilli Lactobacillales Lactobacillaceae
4  Prevotella 9 0.0009573787 Bacteria Bacteroidetes   Bacteroidia   Bacteroidales   Prevotellaceae
5  Prevotella 9 0.0009573787 Bacteria Bacteroidetes   Bacteroidia   Bacteroidales   Prevotellaceae
6  Prevotella 9 0.0009573787 Bacteria Bacteroidetes   Bacteroidia   Bacteroidales   Prevotellaceae
7  Anaerovibrio 0.0049035545 Bacteria    Firmicutes Negativicutes Selenomonadales  Veillonellaceae
8  Anaerovibrio 0.0049035545 Bacteria    Firmicutes Negativicutes Selenomonadales  Veillonellaceae
9     Roseburia 0.0026672558 Bacteria    Firmicutes    Clostridia   Clostridiales  Lachnospiraceae

I don't know why everything in x is getting duplicated according tho the number of repetitions in y. My desired output should have the same row dimension as df_gen, adding columns from df_tax.

I tried also with dplyr::left_join and I end up with the same problem.

I checked other posts on the internet but I found nothing to solve this issue... Any clues?

Upvotes: 0

Views: 584

Answers (3)

jay.sf
jay.sf

Reputation: 72613

As a solution to your problem you could write a function that matches on rows, similar to this one:

matchRows <- function(df1, df2, by) {
  do.call(rbind, apply(df1, 1, function(x) {
    m <- match(x[[by]], df2[[by]])
    `rownames<-`(cbind(t(x), df2[m, -which(names(df2) == by)]), NULL)
  }))}

matchRows(df1=df_gen, df2=df_tax, by="Genus")
#           Genus      mean_RA  Kingdom        Phylum         Class           Order           Family
# 1  Unclassified 0.1357401738     <NA>          <NA>          <NA>            <NA>             <NA>
# 2 Lactobacillus 0.0003825068 Bacteria    Firmicutes       Bacilli Lactobacillales Lactobacillaceae
# 3   Prevotella9 0.0009573787 Bacteria Bacteroidetes   Bacteroidia   Bacteroidales   Prevotellaceae
# 4  Anaerovibrio 0.0049035545 Bacteria    Firmicutes Negativicutes Selenomonadales  Veillonellaceae
# 5     Roseburia 0.0026672558 Bacteria    Firmicutes    Clostridia   Clostridiales  Lachnospiraceae

Data:

df_gen <- structure(list(Genus = c("Unclassified", "Lactobacillus", "Prevotella9", 
"Anaerovibrio", "Roseburia"), mean_RA = c(0.1357401738, 0.0003825068, 
0.0009573787, 0.0049035545, 0.0026672558)), row.names = c(NA, 
-5L), class = "data.frame")

df_tax <- structure(list(Kingdom = c("Bacteria", "Bacteria", "Bacteria", 
"Bacteria", "Bacteria", "Bacteria", "Bacteria", "Bacteria"), 
    Phylum = c("Bacteroidetes", "Bacteroidetes", "Bacteroidetes", 
    "Firmicutes", "Firmicutes", "Firmicutes", "Firmicutes", "Firmicutes"
    ), Class = c("Bacteroidia", "Bacteroidia", "Bacteroidia", 
    "Bacilli", "Negativicutes", "Negativicutes", "Bacilli", "Clostridia"
    ), Order = c("Bacteroidales", "Bacteroidales", "Bacteroidales", 
    "Lactobacillales", "Selenomonadales", "Selenomonadales", 
    "Lactobacillales", "Clostridiales"), Family = c("Prevotellaceae", 
    "Prevotellaceae", "Prevotellaceae", "Lactobacillaceae", "Veillonellaceae", 
    "Veillonellaceae", "Lactobacillaceae", "Lachnospiraceae"), 
    Genus = c("Prevotella9", "Prevotella9", "Prevotella9", "Lactobacillus", 
    "Anaerovibrio", "Anaerovibrio", "Lactobacillus", "Roseburia"
    )), row.names = c(NA, -8L), class = "data.frame")

Upvotes: 0

Claudio
Claudio

Reputation: 1528

Both merge(x, y, all.x=TRUE) and left_join(x, y) will keep all rows from x whether or not they have a match in y, so basically these commands avoid non-matching rows in x to be discarded, but do not avoid multiple matching. If y has duplicates on the key variable (in your case, "Genus"), and they have a match in x, you will get duplicates. From a plain logic point of view, that makes sense: which of the two duplicated rows in y should be matched? The function has no way to know, so it matches both. If you want to get a file with the same row number of df_genus, you need df_tax to have no duplicates. If the rows with duplicated Genus are identical also with respect to the other variables, you can go along the line of the comment by r.user.05apr: df_tax_unique <- df_tax[!duplicated(df_tax$Genus), ]: this will only keep the first of duplicated rows. If rows have the same Genus, but differ with respect to the other variable, you need to make decision according to your needs: you can augment df_genus or you can delete from df_tax the rows you don't want to add to df_genus.

Upvotes: 1

user2974951
user2974951

Reputation: 10375

The function is working as expected, it's merging each row from df_tax on df_gen, and since there are multiple values present in df_tax that match a value in df_gen, you get multiple rows. df_tax has duplicated rows, that is the issue.

Upvotes: 2

Related Questions