user5935577
user5935577

Reputation:

R losing data with distinct()

Using distinct to remove duplicates within a combined dataset however Im losing data because distinct only keeps the first entry.

Example data frame "a"

 SiteID PYear   Habitat num.1
000901W 2011    W   NA
001101W 2007    W   NA
001801W 2005    W   NA
002001W 2017    W   NA
002401F 2006    F   NA
002401F 2016    F   NA
004001F 2006    F   NA
004001W 2006    W   NA
004101W 2007    W   NA
004101W 2007    W   16
004701F 2017    F   NA
006201F 2008    F   NA
006501F 2009    F   NA
006601W 2007    W   2
006601W 2007    W   NA
006803F 2009    F   NA
007310F 2018    F   NA
007602W 2017    W   NA
008103W 2011    W   NA
008203F 2007    F   1

Coding:

a<-distinct(a,SiteID, .keep_all = TRUE)

I would like to know how to remove duplicates based on SiteID and num.1 removing duplicates however I dont want to get rid of duplicates that have number values in the num.1 column. For example, in the dataframe a 004101W and 006601W have multiple entries but I want to keep the integer rather than the NA.

Upvotes: 2

Views: 730

Answers (1)

Jon Spring
Jon Spring

Reputation: 66415

(Thank you for updating with more representative sample data!)

a now has 20 rows, with 17 different SiteID values.

Three of those SiteIDs have multiple rows:

library(tidyverse)
a %>% 
  add_count(SiteID) %>%
  filter(n > 1)

## A tibble: 6 x 5
#  SiteID  PYear Habitat num.1     n
#  <chr>   <int> <chr>   <int> <int>
#1 002401F  2006 F          NA     2    # Both have NA for num.1
#2 002401F  2016 F          NA     2    #  ""

#3 004101W  2007 W          NA     2    # Drop 
#4 004101W  2007 W          16     2    # Keep this one

#5 006601W  2007 W           2     2    # Keep this one
#6 006601W  2007 W          NA     2    # Drop

If we want to prioritize the rows without NA in num.1, we can arrange by num.1 within each SiteID, such that NAs come last for each SiteID, and the distinct function will prioritize num.1's with a non-NA value.

(An alternative is also provided in case you want to keep the original sorting in a, but still moving NA values in num.1 to the end. In the is.na(num.1) term, NA's will evaluate as TRUE and will come after provided values, which will evaluate as FALSE.)

a %>% 
  arrange(SiteID, num.1) %>%
  #arrange(SiteID, is.na(num.1)) %>%    # Alternative to preserve orig order
  distinct(SiteID, .keep_all = TRUE)

    SiteID PYear Habitat num.1
1  000901W  2011       W    NA
2  001101W  2007       W    NA
3  001801W  2005       W    NA
4  002001W  2017       W    NA
5  002401F  2006       F    NA     # Kept first appearing row, since both NA num.1
6  004001F  2006       F    NA
7  004001W  2006       W    NA
8  004101W  2007       W    16     # Kept non-NA row
9  004701F  2017       F    NA
10 006201F  2008       F    NA
11 006501F  2009       F    NA
12 006601W  2007       W     2     # Kept non-NA row
13 006803F  2009       F    NA
14 007310F  2018       F    NA
15 007602W  2017       W    NA
16 008103W  2011       W    NA
17 008203F  2007       F     1

Import of sample data

a <- read.table(header = T, stringsAsFactors = F,
  text = " SiteID PYear   Habitat num.1
000901W 2011    W   NA
001101W 2007    W   NA
001801W 2005    W   NA
002001W 2017    W   NA
002401F 2006    F   NA
002401F 2016    F   NA
004001F 2006    F   NA
004001W 2006    W   NA
004101W 2007    W   NA
004101W 2007    W   16
004701F 2017    F   NA
006201F 2008    F   NA
006501F 2009    F   NA
006601W 2007    W   2
006601W 2007    W   NA
006803F 2009    F   NA
007310F 2018    F   NA
007602W 2017    W   NA
008103W 2011    W   NA
008203F 2007    F   1")

Upvotes: 2

Related Questions