OFW
OFW

Reputation: 19

How to consolidate duplicates

I have several duplicate entries in a data set. I want to combine and then add the end column together. See below.

What I have:

Main | FLOW22016| FLOW2| Forest Lakes| 2016| 2016-10-03| 0| Creek chub| Semotilus atromaculatus| 1|

Main | FLOW22016| FLOW2| Forest Lakes| 2016| 2016-10-03| 0| Creek chub| Semotilus atromaculatus| 1|

What I want

Main | FLOW22016| FLOW2| Forest Lakes| 2016| 2016-10-03| 0| Creek chub| Semotilus atromaculatus | 2|

Is this possible? I have many data points like this throughout and I would like to combine.

Upvotes: 1

Views: 87

Answers (4)

hello_friend
hello_friend

Reputation: 5788

Long form generic Base R to keep unique non-numeric vectors' values and sum the numeric vectors:

# Function to group data by key: .grouping_func => function
.grouping_func <- function(vec){
  # Calculate the run length encoding: r_l_e => rle
  r_l_e <- rle(vec)
  # Expand it out into the rle_id: rle_id => integer vector
  rle_id <- rep(
    seq_along(r_l_e$values), 
    times = r_l_e$lengths
  )
  # Explicitly define the return object rle_id => GlobalEnv
  return(rle_id)
}

# Function to resolve numeric vectors in a data.frame
# resolve_num_vecs => function
resolve_num_vecs <- function(df){
  # Resolve which vectors are numeric: 
  # num_cols => logical vector
  num_cols <- setNames(
    vapply(
      df, 
      is.numeric,
      logical(1)
    ),
    colnames(df)
  )
  # logical vector => Env
  return(num_cols)
}

# Function to combine list of data.frames into df: 
# df_list_2_df => function
df_list_2_df <- function(df_list, cmb_func = c(rbind, cbind)){
  # Resolve the desired combination function: 
  # cmb_func_resolved => character scalar
  cmb_func_resolved <- match.fun(cmb_func)
  # Combine list of data.frames into a data.frame 
  # using a given combination function: res => data.frame
  res <- data.frame(
    do.call(
      cmb_func_resolved,
      df_list
    ),
    row.names = NULL
  )
  # Explicitly define the returned object: 
  # data.frame => Env
  return(res)
}

# Function to bucket the data.frame 
# bucket_df => function
bucket_df <- function(df, logical_vector){
  # Apply the grouping function: res => character vector
  res <- .grouping_func(
    apply(
      df[,logical_vector],
      1,
      toString
    )
  )
  # Explicitly define returned object: 
  # character vector => Env
  return(res)
}

# Function to aggregate data.frame in specified manner:
# agg_func => function
agg_func <- function(df, logical_vector){
  # Column bind unique non-numeric columns with 
  # the sum of numeric columns into a data.frame:
  # res => data.frame
  res <- cbind(
    unique(df[,!logical_vector]),
    setNames(
      data.frame(
        t(
          colSums(df[,logical_vector])
        )
      ),
      names(logical_vector)[logical_vector]
    )
  )[,names(df)]
  # Explicitly define returned object: 
  # data.frame => Env
  return(res)
}

# Coerce the v6 vector to be of Date type:
# v6 => Date vector
dat$V6 <- as.Date(paste0(dat$V6, "-12-31"), "%Y-%m-%d")

# Resolve which vectors are numeric: 
# num_cols => logical vector
num_cols <- resolve_num_vecs(dat)

# Split-apply-combine by group using the 
# defined aggregate function: res => data.frame
res <- df_list_2_df( 
  lapply(
    split(
      dat, 
      bucket_df(
        dat, 
        !num_cols
      )
    ),
    function(x){
      agg_func(x, num_cols)
    }
  ),
  rbind
)

Upvotes: 1

jay.sf
jay.sf

Reputation: 72633

First, put the duplicates into bins with increasing numbers, count them and unique the data frame.

In R>4.1 you could do:

dat <- transform(dat, bin=cumsum(!duplicated(dat[-(ncol(dat))]))) |>
  transform(V13=ave(V13, bin, FUN=sum)) |>
  unique()
dat
#     V1        V2    V3     V4    V5   V6         V7 V8    V9  V10          V11           V12 V13 bin
# 1 Main FLOW22016 FLOW2 Forest Lakes 2016 2016-10-03  0 Creek chub    Semotilus atromaculatus   3   1
# 4 Main FLOW22016 FLOW2   Nile river 2016 2016-10-03  0 Creek chub Chelaethiops         bibie   2   2

Data

dat <- read.table(header=TRUE, text='
    V1        V2    V3     V4    V5   V6         V7 V8    V9  V10       V11           V12 V13
 Main FLOW22016 FLOW2 Forest Lakes 2016 2016-10-03  0 Creek chub Semotilus atromaculatus   1           
 Main FLOW22016 FLOW2 Forest Lakes 2016 2016-10-03  0 Creek chub Semotilus atromaculatus   1           
 Main FLOW22016 FLOW2 Forest Lakes 2016 2016-10-03  0 Creek chub Semotilus atromaculatus   1           
 Main FLOW22016 FLOW2 Nile river 2016 2016-10-03  0 Creek chub Chelaethiops bibie   1           
 Main FLOW22016 FLOW2 Nile river 2016 2016-10-03  0 Creek chub Chelaethiops bibie   1           
           ')

Upvotes: 1

U13-Forward
U13-Forward

Reputation: 71560

You mean:

library(dplyr)
df %>%
  group_by(lastcolumn) %>%
  mutate(lastcolumn = sum(lastcolumn))

Upvotes: 0

Kra.P
Kra.P

Reputation: 15123

Try,

library(dplyr)

df %>%
  group_by(across(c( - lastcolumn))) %>%
  summarise(lastcolumn = n())

where lastcolumn is the column that was 1 and needed to become two.

Upvotes: 1

Related Questions