Marinka
Marinka

Reputation: 1269

Remove row based on two factor levels

I had a problem that is very similar to this question, however my data is grouped by two levels.

str(dt)
'data.frame':   202206 obs. of  4 variables:
$ cros : int  -205 -200 -195 -190 -185 -180 -175 -170 -165 -160 ...
$ along: Factor w/ 113 levels "100","101","102",..: 1 1 1 1 1 1 1 1 1 1 ...
$ alti : num  1.61 1.6 1.6 1.6 1.6 1.59 1.59 1.59 1.59 1.58 ...
$ year : Factor w/ 6 levels "1979","1983",..: 1 1 1 1 1 1 1 1 1 1 ...

head(dt)
cros along alti year
-205   100 1.61 1979
-200   100 1.60 1979
-195   100 1.60 1979
-190   100 1.60 1979
-185   100 1.60 1979
-180   100 1.59 1979

This data is information from different transects which is the variable along, over that transect they measured at every 5 meter which is the variable cros the altitude which is the variable alti. This they have done over multiple years, however sometimes the transect was longer at a particular year. So I want to remove the rows with a cros points that were not measured all years.

For my data set I have one factor (along) with 113 levels and within that factor I have the factor year with 6 levels. Within these to values I have x (along) and y (alti) which I want to do analysis over the year however for the years the x has to be the same values. I want for the factor cros to remove the values that do not occur at all the years for each factor of along.

The code I used was:

require(data.table)
dt <- as.data.table(total)
tt <- dt[,length(unique(along,year)),by=cros]
tt <- tt[V1==max(V1)]
test <-dt[cros %in% tt$cros]

But I do not get the right result. I can image that unique(along,year) is not the right way to work with grouped data. However I do not know how to do it right.

Oke here is a little bit more clearly what I want

> df <- data.frame(along =       c(10,10,10,10,10,10,10,10,11,11,11,11,11,11,11,11,12,12,12,12,12,12,12,12,12,12,12,12,12), year = c(20,20,20,25,25,25,21,21,20,20,25,25,25,21,21,21,20,20,20,20,25,25,25,25,25,21,21,21,21), cros = c(11,12,13,11,12,13,11,12,11,12,11,12,13,11,12,13,14,15,16,17,14,15,16,17,18,12,13,14,15), value = ceiling(rnorm(29)*10))
> df
   along year cros value
   10    20   11    -3
   10    20   12     5
   10    20   13   -22
   10    25   11    -9
   10    25   12    -3
   10    25   13    -8
   10    21   11    -8
   10    21   12    -8
   11    20   11     7
   11    20   12    -4
   11    25   11    -6
   11    25   12     9
   11    25   13    -5
   11    21   11     6
   11    21   12    17
   11    21   13    -5
   12    20   14   -16
   12    20   15   -17
   12    20   16   -18
   12    20   17    -3
   12    25   14   -18
   12    25   15   -11
   12    25   16    -1
   12    25   17     6
   12    25   18    14
   12    21   12    -3
   12    21   13    19
   12    21   14    16
   12    21   15     7

And this is how I want it to look like, so that the cros (x) values that do not occur for all the years for a give transect are removed.

   along year cros value
    10    20   11    -3
    10    20   12     5
    10    25   11    -9
    10    25   12    -3
    10    21   11    -8
    10    21   12    -8
    11    20   11     7
    11    20   12    -4
    11    25   11    -6
    11    25   12     9
    11    21   11     6
    11    21   12    17
    12    20   14   -16
    12    20   15   -17
    12    25   14   -18
    12    25   15   -11
    12    21   14    16
    12    21   15     7

Upvotes: 1

Views: 304

Answers (2)

eddi
eddi

Reputation: 49448

Here's one way of doing it. Find all the along,cros entries that you want to keep and then merge them back:

dt = data.table(df)

# find the intersections; run in pieces to see what's going on here
to.keep = dt[, list(list(unique(cros))), by = list(along, year)][,
               list(cros = Reduce(intersect, V1)), by = along]

# set the keys to merge together
setkey(to.keep, along, cros)
setkey(dt, along, cros)

# final result
res = to.keep[dt, nomatch = 0]

# optionally, you can order and rearrange columns
setkey(res, along, year, cros)[, .SD, .SDcols = names(dt)]
#    along year cros value
# 1:    10   20   11    11
# 2:    10   20   12     7
# 3:    10   21   11    -4
# 4:    10   21   12     9
# 5:    10   25   11   -16
# 6:    10   25   12     8
# 7:    11   20   11    17
# 8:    11   20   12     1
# 9:    11   21   11     8
#10:    11   21   12   -13
#11:    11   25   11    -7
#12:    11   25   12    17
#13:    12   20   14    12
#14:    12   20   15    -7
#15:    12   21   14     3
#16:    12   21   15     9
#17:    12   25   14     6
#18:    12   25   15    -2

Upvotes: 1

SamanthaDS
SamanthaDS

Reputation: 1143

Edited based on my updated understanding of the problem 1-along is transect ID 2-cros is sampling point on a given transect 3- all transects were sampled in all years but not all sampling points on each transect were sampled in all years 4-The issue is to remove all sampling points that were not sampled for all six years

Here is a way to remove those rows:

require(plyr)
count_var<-ddply(dt, ~along+cros, summarise, count = length(year))
str(count_var)
dt<-merge(dt, count_var, by = c("along", "cros"), all.x = T)
dt_all6<-subset(dt, count==6)

To make dt with all sampling points from all years:

along<-as.factor(rep(1:113, 54))
year<-as.factor(c(rep(1979, 1017), rep(1980, 1017), rep(1981, 1017), rep(1982, 1017), rep(1983, 1017), rep(1984, 1017)))
cros_A<-c(rep(5, 113), rep(10, 113), rep(15, 113), rep(20, 113), rep(25, 113), rep(30, 113), rep(35, 113), rep(40, 113), rep(45, 113))
cros<-as.factor(rep(cros_A, 6))
set.seed(2)
alti<-rnorm(6102, mean = 1.5, sd = 0.5)
dt<-cbind.data.frame(along, year, cros, alti)

Now remove some sampling points:

dt<-dt[c(1:100, 106:400, 406:1500, 1506:1600, 1606:2500, 2506:3000, 3006:3500, 3506:4000, 4006:5000, 5006:6102), ]

Upvotes: 0

Related Questions