Reputation: 463
Here are my two dataframes:
structure(list(Author = c("Bubb, D. H., et al.", "Bubb, D. H., et al.",
"Bubb, D. H., et al.", "Bubb, D. H., et al.", "Bubb, D. H., et al.",
"Bubb, D. H., et al.", "Bubb, D. H., et al.", "Bubb, D. H., et al.",
"Bubb, D. H., et al.", "Bubb, D. H., et al.", "Robinson et al.",
"Robinson et al.", "Robinson et al.", "Robinson et al.", "Louca et al.",
"Aquiloni, L., et al.", "Aquiloni, L., et al.", "Barbaresi, S., et al.",
"Barbaresi, S., et al.", "Barbaresi, S., et al.", "Gherardi, F., et al.",
"Gherardi, F., et al.", "Gherardi, F., et al.", "Loughman et al.",
"Loughman et al.", "Hall et al.", "Holsman et al. ", "Holsman et al. ",
"Smith B.D et al.", "Smith B.D et al."), Year = c(2006L, 2006L,
2006L, 2002L, 2002L, 2002L, 2002L, 2004L, 2004L, 2004L, 2000L,
2000L, 2000L, 2000L, 2014L, 2005L, 2005L, 2004L, 2004L, 2004L,
2002L, 2002L, 2002L, 2013L, 2013L, 1991L, 2006L, 2006L, 1991L,
1991L), Purpose = c("Invasive/Endangered Species", "Movement Metrics",
"Movement Metrics", "Invasive/Endangered Species", "Movement Metrics",
"Movement Metrics", "Movement Metrics", "Invasive/Endangered Species",
"Movement Metrics", "Movement Metrics", "Movement Metrics", "Movement Metrics",
"Movement Metrics", "Invasive/Endangered Species", "Human Interaction",
"Invasive/Endangered Species", "Habitat Use", "Invasive/Endangered Species",
"Feeding/Behavior", "Movement Metrics", "Movement Metrics", "Invasive/Endangered Species",
"Feeding/Behavior", "Movement Metrics", "Habitat Use", "Movement Metrics",
"Habitat Use", "Movement Metrics", "Movement Metrics", "Habitat Use"
)), row.names = c(NA, 30L), class = "data.frame")
structure(list(Author = c("Aquiloni, L., et al.", "Aquiloni, L., et al.",
"Barbaresi, S., et al.", "Barbaresi, S., et al.", "Barbaresi, S., et al.",
"Bubb, D. H., et al.", "Bubb, D. H., et al.", "Bubb, D. H., et al.",
"Bubb, D. H., et al.", "Bubb, D. H., et al.", "Bubb, D. H., et al.",
"Gherardi, F., et al.", "Gherardi, F., et al.", "Gherardi, F., et al.",
"Hall et al.", "Holsman et al. ", "Holsman et al. ", "Louca et al.",
"Loughman et al.", "Loughman et al.", "Robinson et al.", "Robinson et al.",
"Smith B.D et al.", "Smith B.D et al."), Year = c(2005L, 2005L,
2004L, 2004L, 2004L, 2002L, 2002L, 2004L, 2004L, 2006L, 2006L,
2002L, 2002L, 2002L, 1991L, 2006L, 2006L, 2014L, 2013L, 2013L,
2000L, 2000L, 1991L, 1991L), Purpose = c("Habitat Use", "Invasive/Endangered Species",
"Feeding/Behavior", "Invasive/Endangered Species", "Movement Metrics",
"Invasive/Endangered Species", "Movement Metrics", "Invasive/Endangered Species",
"Movement Metrics", "Invasive/Endangered Species", "Movement Metrics",
"Feeding/Behavior", "Invasive/Endangered Species", "Movement Metrics",
"Movement Metrics", "Habitat Use", "Movement Metrics", "Human Interaction",
"Habitat Use", "Movement Metrics", "Invasive/Endangered Species",
"Movement Metrics", "Habitat Use", "Movement Metrics"), count = c(1L,
1L, 1L, 1L, 1L, 1L, 3L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 3L, 1L, 1L)), class = "data.frame", row.names = c(NA,
-24L))
The first dataframe shows Author, Year, and the purposes of the study (of which there can be multiple). However, because of the way I created the data, there are some duplicates (i.e. Robinson et al. 2000 has "Movement Metrics" listed three times when I only want it listed once).
I would use the duplicated
or unique
functions, but my original DF has many more nonunique columns.
So instead, I created the second dataframe which is grouped by Author/year/purpose such that there is a count for each combination of the three variables. Is there any way for me to say:
if DF2$count > 1, find the matching rows in DF1 and delete the n(counts)-1 rows.
An example:
"SomeFunction" identifies rows in DF2 with count > 1.
"SomeFunction" takes author and year columns in DF2 and matches with DF1
"SomeFunction" removes duplicated rows, leaving a single row for each Author/Year/Purpose combination
Upvotes: 1
Views: 27
Reputation: 16178
If your goal is to remove duplicated rows, you can use distinct
function from dplyr
package directly on your first dataframe:
library(dplyr)
df1 %>% distinct(Author, Year, Purpose, .keep_all = TRUE)
Author Year Purpose
1 Bubb, D. H., et al. 2006 Invasive/Endangered Species
2 Bubb, D. H., et al. 2006 Movement Metrics
3 Bubb, D. H., et al. 2002 Invasive/Endangered Species
4 Bubb, D. H., et al. 2002 Movement Metrics
5 Bubb, D. H., et al. 2004 Invasive/Endangered Species
6 Bubb, D. H., et al. 2004 Movement Metrics
7 Robinson et al. 2000 Movement Metrics
8 Robinson et al. 2000 Invasive/Endangered Species
9 Louca et al. 2014 Human Interaction
10 Aquiloni, L., et al. 2005 Invasive/Endangered Species
11 Aquiloni, L., et al. 2005 Habitat Use
12 Barbaresi, S., et al. 2004 Invasive/Endangered Species
13 Barbaresi, S., et al. 2004 Feeding/Behavior
14 Barbaresi, S., et al. 2004 Movement Metrics
15 Gherardi, F., et al. 2002 Movement Metrics
16 Gherardi, F., et al. 2002 Invasive/Endangered Species
17 Gherardi, F., et al. 2002 Feeding/Behavior
18 Loughman et al. 2013 Movement Metrics
19 Loughman et al. 2013 Habitat Use
20 Hall et al. 1991 Movement Metrics
21 Holsman et al. 2006 Habitat Use
22 Holsman et al. 2006 Movement Metrics
23 Smith B.D et al. 1991 Movement Metrics
24 Smith B.D et al. 1991 Habitat Use
Adding the .keep_all = TRUE
argument keeps all additional columns.
Upvotes: 1