ljh2001
ljh2001

Reputation: 463

How to remove rows in one DF by counts in another DF?

Here are my two dataframes:

structure(list(Author = c("Bubb, D. H., et al.", "Bubb, D. H., et al.", 
"Bubb, D. H., et al.", "Bubb, D. H., et al.", "Bubb, D. H., et al.", 
"Bubb, D. H., et al.", "Bubb, D. H., et al.", "Bubb, D. H., et al.", 
"Bubb, D. H., et al.", "Bubb, D. H., et al.", "Robinson et al.", 
"Robinson et al.", "Robinson et al.", "Robinson et al.", "Louca et al.", 
"Aquiloni, L., et al.", "Aquiloni, L., et al.", "Barbaresi, S., et al.", 
"Barbaresi, S., et al.", "Barbaresi, S., et al.", "Gherardi, F., et al.", 
"Gherardi, F., et al.", "Gherardi, F., et al.", "Loughman et al.", 
"Loughman et al.", "Hall et al.", "Holsman et al. ", "Holsman et al. ", 
"Smith B.D et al.", "Smith B.D et al."), Year = c(2006L, 2006L, 
2006L, 2002L, 2002L, 2002L, 2002L, 2004L, 2004L, 2004L, 2000L, 
2000L, 2000L, 2000L, 2014L, 2005L, 2005L, 2004L, 2004L, 2004L, 
2002L, 2002L, 2002L, 2013L, 2013L, 1991L, 2006L, 2006L, 1991L, 
1991L), Purpose = c("Invasive/Endangered Species", "Movement Metrics", 
"Movement Metrics", "Invasive/Endangered Species", "Movement Metrics", 
"Movement Metrics", "Movement Metrics", "Invasive/Endangered Species", 
"Movement Metrics", "Movement Metrics", "Movement Metrics", "Movement Metrics", 
"Movement Metrics", "Invasive/Endangered Species", "Human Interaction", 
"Invasive/Endangered Species", "Habitat Use", "Invasive/Endangered Species", 
"Feeding/Behavior", "Movement Metrics", "Movement Metrics", "Invasive/Endangered Species", 
"Feeding/Behavior", "Movement Metrics", "Habitat Use", "Movement Metrics", 
"Habitat Use", "Movement Metrics", "Movement Metrics", "Habitat Use"
)), row.names = c(NA, 30L), class = "data.frame")
structure(list(Author = c("Aquiloni, L., et al.", "Aquiloni, L., et al.", 
"Barbaresi, S., et al.", "Barbaresi, S., et al.", "Barbaresi, S., et al.", 
"Bubb, D. H., et al.", "Bubb, D. H., et al.", "Bubb, D. H., et al.", 
"Bubb, D. H., et al.", "Bubb, D. H., et al.", "Bubb, D. H., et al.", 
"Gherardi, F., et al.", "Gherardi, F., et al.", "Gherardi, F., et al.", 
"Hall et al.", "Holsman et al. ", "Holsman et al. ", "Louca et al.", 
"Loughman et al.", "Loughman et al.", "Robinson et al.", "Robinson et al.", 
"Smith B.D et al.", "Smith B.D et al."), Year = c(2005L, 2005L, 
2004L, 2004L, 2004L, 2002L, 2002L, 2004L, 2004L, 2006L, 2006L, 
2002L, 2002L, 2002L, 1991L, 2006L, 2006L, 2014L, 2013L, 2013L, 
2000L, 2000L, 1991L, 1991L), Purpose = c("Habitat Use", "Invasive/Endangered Species", 
"Feeding/Behavior", "Invasive/Endangered Species", "Movement Metrics", 
"Invasive/Endangered Species", "Movement Metrics", "Invasive/Endangered Species", 
"Movement Metrics", "Invasive/Endangered Species", "Movement Metrics", 
"Feeding/Behavior", "Invasive/Endangered Species", "Movement Metrics", 
"Movement Metrics", "Habitat Use", "Movement Metrics", "Human Interaction", 
"Habitat Use", "Movement Metrics", "Invasive/Endangered Species", 
"Movement Metrics", "Habitat Use", "Movement Metrics"), count = c(1L, 
1L, 1L, 1L, 1L, 1L, 3L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 3L, 1L, 1L)), class = "data.frame", row.names = c(NA, 
-24L))

The first dataframe shows Author, Year, and the purposes of the study (of which there can be multiple). However, because of the way I created the data, there are some duplicates (i.e. Robinson et al. 2000 has "Movement Metrics" listed three times when I only want it listed once).

I would use the duplicated or unique functions, but my original DF has many more nonunique columns.

So instead, I created the second dataframe which is grouped by Author/year/purpose such that there is a count for each combination of the three variables. Is there any way for me to say:

if DF2$count > 1, find the matching rows in DF1 and delete the n(counts)-1 rows.

An example:

"SomeFunction" identifies rows in DF2 with count > 1.

"SomeFunction" takes author and year columns in DF2 and matches with DF1

"SomeFunction" removes duplicated rows, leaving a single row for each Author/Year/Purpose combination

Upvotes: 1

Views: 27

Answers (1)

dc37
dc37

Reputation: 16178

If your goal is to remove duplicated rows, you can use distinct function from dplyr package directly on your first dataframe:

library(dplyr)
df1 %>% distinct(Author, Year, Purpose, .keep_all = TRUE)
                  Author Year                     Purpose
1    Bubb, D. H., et al. 2006 Invasive/Endangered Species
2    Bubb, D. H., et al. 2006            Movement Metrics
3    Bubb, D. H., et al. 2002 Invasive/Endangered Species
4    Bubb, D. H., et al. 2002            Movement Metrics
5    Bubb, D. H., et al. 2004 Invasive/Endangered Species
6    Bubb, D. H., et al. 2004            Movement Metrics
7        Robinson et al. 2000            Movement Metrics
8        Robinson et al. 2000 Invasive/Endangered Species
9           Louca et al. 2014           Human Interaction
10  Aquiloni, L., et al. 2005 Invasive/Endangered Species
11  Aquiloni, L., et al. 2005                 Habitat Use
12 Barbaresi, S., et al. 2004 Invasive/Endangered Species
13 Barbaresi, S., et al. 2004            Feeding/Behavior
14 Barbaresi, S., et al. 2004            Movement Metrics
15  Gherardi, F., et al. 2002            Movement Metrics
16  Gherardi, F., et al. 2002 Invasive/Endangered Species
17  Gherardi, F., et al. 2002            Feeding/Behavior
18       Loughman et al. 2013            Movement Metrics
19       Loughman et al. 2013                 Habitat Use
20           Hall et al. 1991            Movement Metrics
21       Holsman et al.  2006                 Habitat Use
22       Holsman et al.  2006            Movement Metrics
23      Smith B.D et al. 1991            Movement Metrics
24      Smith B.D et al. 1991                 Habitat Use

Adding the .keep_all = TRUE argument keeps all additional columns.

Upvotes: 1

Related Questions