kl-higgins
kl-higgins

Reputation: 313

Select rows based on multi-column attributes in R

I have to merge two datasets (spatial) for which I need to keep the row (polyline) with the most information (i.e. the longest line).

I can select the rows with the same ID as the other dataframe or not select rows with the same ID (see sample below). And reverse that operation. But I can't figure out how to choose the row of the two dataframes that has the bigger length value.

#set up sample data
sample = data.frame(x=c(1:5), length=c(1.2,1.3,1.5,7.2,36.1), ID_obj=c("a3", "4a", "5b", "8b", "a7"))
sample2 = data.frame(x=c(1:5), length=c(15.1,1.3,1.5,17.2,6.1), ID_obj=c("a3", "k6", "9c", "8b", "a7"))

#select the rows with identical values
sample[sample$ID_obj %in% sample2$ID_obj,]
sample2[sample2$ID_obj %in% sample$ID_obj,]

#select rows without duplicates in ID
'%!in%' <- function(x,y)!('%in%'(x,y))
sample[sample$ID_obj %!in% sample2$ID_obj,]
sample2[sample2$ID_obj %!in% sample$ID_obj,]

error<-rbind(sample, sample2[sample2$ID_obj %!in% sample$ID_obj,])
#   x length ID_obj
#1  1    1.2     a3
#2  2    1.3     4a
#3  3    1.5     5b
#4  4    7.2     8b#keep 8b from the first set should not have been kept because length is shorter
#5  5   36.1     a7
#21 2    1.3     k6
#31 3    1.5     9c

#this is the result I want to get automatically    
final<-rbind(sample[c(2, 3, 5),], sample2[c(1, 2, 3, 4),])#
#   x length ID_obj
#2  2    1.3     4a
#3  3    1.5     5b
#5  5   36.1     a7#keep a7 from the first set because length is longer
#1  1   15.1     a3
#21 2    1.3     k6
#31 3    1.5     9c
#4  4   17.2     8b#keep 8b from the second set because length is longer

Upvotes: 0

Views: 183

Answers (2)

dcarlson
dcarlson

Reputation: 11066

A bit more more cryptic with base functions, but just as an exercise:

x <- rbind(sample, sample2)
x <- x[order(x$length), ]
x <- do.call(rbind, lapply(split(x, x$ID_obj), tail, n=1))
x
#    x length ID_obj
# 4a 2    1.3     4a
# 5b 3    1.5     5b
# 8b 4   17.2     8b
# a3 1   15.1     a3
# a7 5   36.1     a7
# 9c 3    1.5     9c
# k6 2    1.3     k6

Add rownames(x) <- NULL if you don't want to use ID_obj as row names.

Upvotes: 1

R Yoda
R Yoda

Reputation: 8760

Use the data.table package for a simplified syntax (and better performance than data.frame):

sample = data.frame(x=c(1:5), length=c(1.2,1.3,1.5,7.2,36.1), ID_obj=c("a3", "4a", "5b", "8b", "a7"))
sample2 = data.frame(x=c(1:5), length=c(15.1,1.3,1.5,17.2,6.1), ID_obj=c("a3", "k6", "9c", "8b", "a7"))
library(data.table)

setDT(sample)  # convert data.frame to data.table "in-place"
setDT(sample2)

x <- rbind(sample, sample2)  # combine rows vertically
setorder(x, -length)         # order by length descending
x[, head(.SD, 1), by = ID_obj]  # output the first row ("head") per ID_obj group

To get the result (in a different order than your expected result):

   ID_obj x length
1:     a7 5   36.1
2:     8b 4   17.2
3:     a3 1   15.1
4:     5b 3    1.5
5:     9c 3    1.5
6:     4a 2    1.3
7:     k6 2    1.3

Upvotes: 1

Related Questions