Jose
Jose

Reputation: 15

How to eliminate duplicate observations keeping information from both observations?

I have a large dataset that looks like this:

structure(list(species = structure(c(1L, 2L, 3L, 4L, 2L, 3L), .Label = c("Aa achalensis", 
"Aa argyrolepis", "Aa aurantiaca", "Aa calceata"), class = "factor"), 
    establishment = structure(c(1L, 1L, 1L, 1L, 2L, 2L), .Label = c("farm", 
    "field"), class = "factor"), region = structure(c(1L, 1L, 
    1L, 1L, 1L, 1L), .Label = "ABT", class = "factor"), leaf.area = c(0.348046463, 
    0.265755867, 0.382584479, 0.336147631, NA, NA), ses = c(-0.3906, 
    -0.6257, -0.2987, -0.423, NA, NA)), row.names = c(NA, 6L), class = "data.frame")

I would like to eliminate duplicate observations by using that strategy: change the establishment farm by field in the species that are duplicated, keeping all columns from the farm location (complete information), and eliminating the field one (only NA's in the last 2 columns).

I am pretty new in coding and maybe one of you could help me with it.

Upvotes: 0

Views: 33

Answers (1)

Ronak Shah
Ronak Shah

Reputation: 389055

You can try this approach -

library(dplyr)

df %>%
  group_by(species) %>%
  mutate(establishment = if(n() > 1) 'field' else establishment) %>%
  filter(!is.na(leaf.area) & !is.na(ses)) %>%
  ungroup

#  species        establishment region leaf.area    ses
#  <fct>          <chr>         <fct>      <dbl>  <dbl>
#1 Aa achalensis  farm          ABT        0.348 -0.391
#2 Aa argyrolepis field         ABT        0.266 -0.626
#3 Aa aurantiaca  field         ABT        0.383 -0.299
#4 Aa calceata    farm          ABT        0.336 -0.423

This replaces establishment value to 'field' if there are duplicates for the species and keep non-NA rows.

Upvotes: 1

Related Questions