Reputation:
Using distinct to remove duplicates within a combined dataset however Im losing data because distinct only keeps the first entry.
Example data frame "a"
SiteID PYear Habitat num.1
000901W 2011 W NA
001101W 2007 W NA
001801W 2005 W NA
002001W 2017 W NA
002401F 2006 F NA
002401F 2016 F NA
004001F 2006 F NA
004001W 2006 W NA
004101W 2007 W NA
004101W 2007 W 16
004701F 2017 F NA
006201F 2008 F NA
006501F 2009 F NA
006601W 2007 W 2
006601W 2007 W NA
006803F 2009 F NA
007310F 2018 F NA
007602W 2017 W NA
008103W 2011 W NA
008203F 2007 F 1
Coding:
a<-distinct(a,SiteID, .keep_all = TRUE)
I would like to know how to remove duplicates based on SiteID and num.1 removing duplicates however I dont want to get rid of duplicates that have number values in the num.1 column. For example, in the dataframe a 004101W and 006601W have multiple entries but I want to keep the integer rather than the NA.
Upvotes: 2
Views: 730
Reputation: 66415
(Thank you for updating with more representative sample data!)
a
now has 20 rows, with 17 different SiteID
values.
Three of those SiteID
s have multiple rows:
library(tidyverse)
a %>%
add_count(SiteID) %>%
filter(n > 1)
## A tibble: 6 x 5
# SiteID PYear Habitat num.1 n
# <chr> <int> <chr> <int> <int>
#1 002401F 2006 F NA 2 # Both have NA for num.1
#2 002401F 2016 F NA 2 # ""
#3 004101W 2007 W NA 2 # Drop
#4 004101W 2007 W 16 2 # Keep this one
#5 006601W 2007 W 2 2 # Keep this one
#6 006601W 2007 W NA 2 # Drop
If we want to prioritize the rows without NA in num.1
, we can arrange
by num.1 within each SiteID, such that NAs come last for each SiteID, and the distinct
function will prioritize num.1's with a non-NA value.
(An alternative is also provided in case you want to keep the original sorting in a
, but still moving NA values in num.1 to the end. In the is.na(num.1)
term, NA's will evaluate as TRUE and will come after provided values, which will evaluate as FALSE.)
a %>%
arrange(SiteID, num.1) %>%
#arrange(SiteID, is.na(num.1)) %>% # Alternative to preserve orig order
distinct(SiteID, .keep_all = TRUE)
SiteID PYear Habitat num.1
1 000901W 2011 W NA
2 001101W 2007 W NA
3 001801W 2005 W NA
4 002001W 2017 W NA
5 002401F 2006 F NA # Kept first appearing row, since both NA num.1
6 004001F 2006 F NA
7 004001W 2006 W NA
8 004101W 2007 W 16 # Kept non-NA row
9 004701F 2017 F NA
10 006201F 2008 F NA
11 006501F 2009 F NA
12 006601W 2007 W 2 # Kept non-NA row
13 006803F 2009 F NA
14 007310F 2018 F NA
15 007602W 2017 W NA
16 008103W 2011 W NA
17 008203F 2007 F 1
a <- read.table(header = T, stringsAsFactors = F,
text = " SiteID PYear Habitat num.1
000901W 2011 W NA
001101W 2007 W NA
001801W 2005 W NA
002001W 2017 W NA
002401F 2006 F NA
002401F 2016 F NA
004001F 2006 F NA
004001W 2006 W NA
004101W 2007 W NA
004101W 2007 W 16
004701F 2017 F NA
006201F 2008 F NA
006501F 2009 F NA
006601W 2007 W 2
006601W 2007 W NA
006803F 2009 F NA
007310F 2018 F NA
007602W 2017 W NA
008103W 2011 W NA
008203F 2007 F 1")
Upvotes: 2