Asteroid098
Asteroid098

Reputation: 2825

delete rows for duplicate variable in R

I have panel data with duplicate years, but I want to delete the row where job value is smaller:

id  name    year    job
1   Jane    1990    100
1   Jane    1992    200
1   Jane    1993    300
1   Jane    1993    1
1   Jane    1997    400
1   Jane    1997    2
2   Tom     1990    400
2   Tom     1992    500
2   Tom     1993    700
2   Tom     1993    1
2   Tom     1997    900
2   Tom     1997    3

I would want the following:

id  name    year    job
1   Jane    1990    100
1   Jane    1992    200
1   Jane    1993    1
1   Jane    1997    2
2   Tom     1990    400
2   Tom     1992    500
2   Tom     1993    1
2   Tom     1997    3

Would there be a way to do this?

Upvotes: 1

Views: 125

Answers (4)

user1981275
user1981275

Reputation: 13372

You could use base R with the function order, as suggested by James:

> tab[order(tab$job),][! duplicated(tab[order(tab$job), c('id', 'year')], fromLast=T), ]
   id name year job
1   1 Jane 1990 100
2   1 Jane 1992 200
3   1 Jane 1993 300
5   1 Jane 1997 400
7   2  Tom 1990 400
8   2  Tom 1992 500
9   2  Tom 1993 700
11  2  Tom 1997 900

Upvotes: 0

droopy
droopy

Reputation: 2818

you have different possibilities for instance with plyr and dplyr :

# plyr
ddply(tab, .(id, name, year), summarise, job=min(job))
# dplyr
tabg <- group_by(tab, id, name, year)
summarise(tabg, job=min(job))
# basic fonction
aggregate(tab[,"job", drop=FALSE], tab[,3:1], min)

Upvotes: 2

danas.zuokas
danas.zuokas

Reputation: 4643

If your data is data frame df

library(data.table)

dt <- as.data.table(df)
dt[, .SD[which.min(job)], by = list(id, name, year)]

Upvotes: 0

James
James

Reputation: 66834

You can use ddply for this:

x <- read.table(textConnection("id  name    year    job
 1   Jane    1990    100
 1   Jane    1992    200
 1   Jane    1993    300
 1   Jane    1993    1
 1   Jane    1997    400
 1   Jane    1997    2
 2   Tom     1990    400
 2   Tom     1992    500
 2   Tom     1993    700
 2   Tom     1993    1
 2   Tom     1997    900
 2   Tom     1997    3"),header=T)

library(plyr)
ddply(x,c("id","name","year"),summarise, job=max(job))
  id name year job
1  1 Jane 1990 100
2  1 Jane 1992 200
3  1 Jane 1993 300
4  1 Jane 1997 400
5  2  Tom 1990 400
6  2  Tom 1992 500
7  2  Tom 1993 700
8  2  Tom 1997 900

Note that I have obtained what you asked for in the description. Your example output contradicts this. If you do want your example output, use min instead of max.

Upvotes: 0

Related Questions