Reputation: 212
I have two data frames with this structure:
> df_gen[1:5,]
Genus mean_RA
1 Unclassified 0.1357401738
2 Lactobacillus 0.0003825068
3 Prevotella 9 0.0009573787
4 Anaerovibrio 0.0049035545
5 Roseburia 0.0026672558
> df_tax[1:8,]
Kingdom Phylum Class Order Family Genus
1 Bacteria Bacteroidetes Bacteroidia Bacteroidales Prevotellaceae Prevotella 9
2 Bacteria Bacteroidetes Bacteroidia Bacteroidales Prevotellaceae Prevotella 9
3 Bacteria Bacteroidetes Bacteroidia Bacteroidales Prevotellaceae Prevotella 9
4 Bacteria Firmicutes Bacilli Lactobacillales Lactobacillaceae Lactobacillus
5 Bacteria Firmicutes Negativicutes Selenomonadales Veillonellaceae Anaerovibrio
6 Bacteria Firmicutes Negativicutes Selenomonadales Veillonellaceae Anaerovibrio
7 Bacteria Firmicutes Bacilli Lactobacillales Lactobacillaceae Lactobacillus
8 Bacteria Firmicutes Clostridia Clostridiales Lachnospiraceae Roseburia
I want to merge df_gen
with df_tax
, but when I do every row completely duplicates, in this way:
> merge(df_gen, df_tax, by = "Genus", all.x = TRUE)
Genus mean_RA Kingdom Phylum Class Order Family
1 Unclassified 0.1357401738 NA NA NA NA NA
2 Lactobacillus 0.0003825068 Bacteria Firmicutes Bacilli Lactobacillales Lactobacillaceae
3 Lactobacillus 0.0003825068 Bacteria Firmicutes Bacilli Lactobacillales Lactobacillaceae
4 Prevotella 9 0.0009573787 Bacteria Bacteroidetes Bacteroidia Bacteroidales Prevotellaceae
5 Prevotella 9 0.0009573787 Bacteria Bacteroidetes Bacteroidia Bacteroidales Prevotellaceae
6 Prevotella 9 0.0009573787 Bacteria Bacteroidetes Bacteroidia Bacteroidales Prevotellaceae
7 Anaerovibrio 0.0049035545 Bacteria Firmicutes Negativicutes Selenomonadales Veillonellaceae
8 Anaerovibrio 0.0049035545 Bacteria Firmicutes Negativicutes Selenomonadales Veillonellaceae
9 Roseburia 0.0026672558 Bacteria Firmicutes Clostridia Clostridiales Lachnospiraceae
I don't know why everything in x
is getting duplicated according tho the number of repetitions in y
. My desired output should have the same row dimension as df_gen
, adding columns from df_tax
.
I tried also with dplyr::left_join
and I end up with the same problem.
I checked other posts on the internet but I found nothing to solve this issue... Any clues?
Upvotes: 0
Views: 584
Reputation: 72613
As a solution to your problem you could write a function that match
es on rows, similar to this one:
matchRows <- function(df1, df2, by) {
do.call(rbind, apply(df1, 1, function(x) {
m <- match(x[[by]], df2[[by]])
`rownames<-`(cbind(t(x), df2[m, -which(names(df2) == by)]), NULL)
}))}
matchRows(df1=df_gen, df2=df_tax, by="Genus")
# Genus mean_RA Kingdom Phylum Class Order Family
# 1 Unclassified 0.1357401738 <NA> <NA> <NA> <NA> <NA>
# 2 Lactobacillus 0.0003825068 Bacteria Firmicutes Bacilli Lactobacillales Lactobacillaceae
# 3 Prevotella9 0.0009573787 Bacteria Bacteroidetes Bacteroidia Bacteroidales Prevotellaceae
# 4 Anaerovibrio 0.0049035545 Bacteria Firmicutes Negativicutes Selenomonadales Veillonellaceae
# 5 Roseburia 0.0026672558 Bacteria Firmicutes Clostridia Clostridiales Lachnospiraceae
Data:
df_gen <- structure(list(Genus = c("Unclassified", "Lactobacillus", "Prevotella9",
"Anaerovibrio", "Roseburia"), mean_RA = c(0.1357401738, 0.0003825068,
0.0009573787, 0.0049035545, 0.0026672558)), row.names = c(NA,
-5L), class = "data.frame")
df_tax <- structure(list(Kingdom = c("Bacteria", "Bacteria", "Bacteria",
"Bacteria", "Bacteria", "Bacteria", "Bacteria", "Bacteria"),
Phylum = c("Bacteroidetes", "Bacteroidetes", "Bacteroidetes",
"Firmicutes", "Firmicutes", "Firmicutes", "Firmicutes", "Firmicutes"
), Class = c("Bacteroidia", "Bacteroidia", "Bacteroidia",
"Bacilli", "Negativicutes", "Negativicutes", "Bacilli", "Clostridia"
), Order = c("Bacteroidales", "Bacteroidales", "Bacteroidales",
"Lactobacillales", "Selenomonadales", "Selenomonadales",
"Lactobacillales", "Clostridiales"), Family = c("Prevotellaceae",
"Prevotellaceae", "Prevotellaceae", "Lactobacillaceae", "Veillonellaceae",
"Veillonellaceae", "Lactobacillaceae", "Lachnospiraceae"),
Genus = c("Prevotella9", "Prevotella9", "Prevotella9", "Lactobacillus",
"Anaerovibrio", "Anaerovibrio", "Lactobacillus", "Roseburia"
)), row.names = c(NA, -8L), class = "data.frame")
Upvotes: 0
Reputation: 1528
Both merge(x, y, all.x=TRUE)
and left_join(x, y)
will keep all rows from x whether or not they have a match in y, so basically these commands avoid non-matching rows in x to be discarded, but do not avoid multiple matching. If y has duplicates on the key variable (in your case, "Genus"), and they have a match in x, you will get duplicates. From a plain logic point of view, that makes sense: which of the two duplicated rows in y should be matched? The function has no way to know, so it matches both. If you want to get a file with the same row number of df_genus, you need df_tax to have no duplicates. If the rows with duplicated Genus are identical also with respect to the other variables, you can go along the line of the comment by r.user.05apr: df_tax_unique <- df_tax[!duplicated(df_tax$Genus), ]
: this will only keep the first of duplicated rows. If rows have the same Genus, but differ with respect to the other variable, you need to make decision according to your needs: you can augment df_genus or you can delete from df_tax the rows you don't want to add to df_genus.
Upvotes: 1
Reputation: 10375
The function is working as expected, it's merging each row from df_tax
on df_gen
, and since there are multiple values present in df_tax
that match a value in df_gen
, you get multiple rows. df_tax
has duplicated rows, that is the issue.
Upvotes: 2