Reputation: 1269
I had a problem that is very similar to this question, however my data is grouped by two levels.
str(dt)
'data.frame': 202206 obs. of 4 variables:
$ cros : int -205 -200 -195 -190 -185 -180 -175 -170 -165 -160 ...
$ along: Factor w/ 113 levels "100","101","102",..: 1 1 1 1 1 1 1 1 1 1 ...
$ alti : num 1.61 1.6 1.6 1.6 1.6 1.59 1.59 1.59 1.59 1.58 ...
$ year : Factor w/ 6 levels "1979","1983",..: 1 1 1 1 1 1 1 1 1 1 ...
head(dt)
cros along alti year
-205 100 1.61 1979
-200 100 1.60 1979
-195 100 1.60 1979
-190 100 1.60 1979
-185 100 1.60 1979
-180 100 1.59 1979
This data is information from different transects which is the variable along, over that transect they measured at every 5 meter which is the variable cros the altitude which is the variable alti. This they have done over multiple years, however sometimes the transect was longer at a particular year. So I want to remove the rows with a cros points that were not measured all years.
For my data set I have one factor (along
) with 113 levels and within that factor I have the factor year
with 6 levels. Within these to values I have x (along
) and y (alti
) which I want to do analysis over the year however for the years the x has to be the same values. I want for the factor cros
to remove the values that do not occur at all the years
for each factor of along
.
The code I used was:
require(data.table)
dt <- as.data.table(total)
tt <- dt[,length(unique(along,year)),by=cros]
tt <- tt[V1==max(V1)]
test <-dt[cros %in% tt$cros]
But I do not get the right result. I can image that unique(along,year) is not the right way to work with grouped data. However I do not know how to do it right.
Oke here is a little bit more clearly what I want
> df <- data.frame(along = c(10,10,10,10,10,10,10,10,11,11,11,11,11,11,11,11,12,12,12,12,12,12,12,12,12,12,12,12,12), year = c(20,20,20,25,25,25,21,21,20,20,25,25,25,21,21,21,20,20,20,20,25,25,25,25,25,21,21,21,21), cros = c(11,12,13,11,12,13,11,12,11,12,11,12,13,11,12,13,14,15,16,17,14,15,16,17,18,12,13,14,15), value = ceiling(rnorm(29)*10))
> df
along year cros value
10 20 11 -3
10 20 12 5
10 20 13 -22
10 25 11 -9
10 25 12 -3
10 25 13 -8
10 21 11 -8
10 21 12 -8
11 20 11 7
11 20 12 -4
11 25 11 -6
11 25 12 9
11 25 13 -5
11 21 11 6
11 21 12 17
11 21 13 -5
12 20 14 -16
12 20 15 -17
12 20 16 -18
12 20 17 -3
12 25 14 -18
12 25 15 -11
12 25 16 -1
12 25 17 6
12 25 18 14
12 21 12 -3
12 21 13 19
12 21 14 16
12 21 15 7
And this is how I want it to look like, so that the cros (x) values that do not occur for all the years for a give transect are removed.
along year cros value
10 20 11 -3
10 20 12 5
10 25 11 -9
10 25 12 -3
10 21 11 -8
10 21 12 -8
11 20 11 7
11 20 12 -4
11 25 11 -6
11 25 12 9
11 21 11 6
11 21 12 17
12 20 14 -16
12 20 15 -17
12 25 14 -18
12 25 15 -11
12 21 14 16
12 21 15 7
Upvotes: 1
Views: 304
Reputation: 49448
Here's one way of doing it. Find all the along,cros
entries that you want to keep and then merge them back:
dt = data.table(df)
# find the intersections; run in pieces to see what's going on here
to.keep = dt[, list(list(unique(cros))), by = list(along, year)][,
list(cros = Reduce(intersect, V1)), by = along]
# set the keys to merge together
setkey(to.keep, along, cros)
setkey(dt, along, cros)
# final result
res = to.keep[dt, nomatch = 0]
# optionally, you can order and rearrange columns
setkey(res, along, year, cros)[, .SD, .SDcols = names(dt)]
# along year cros value
# 1: 10 20 11 11
# 2: 10 20 12 7
# 3: 10 21 11 -4
# 4: 10 21 12 9
# 5: 10 25 11 -16
# 6: 10 25 12 8
# 7: 11 20 11 17
# 8: 11 20 12 1
# 9: 11 21 11 8
#10: 11 21 12 -13
#11: 11 25 11 -7
#12: 11 25 12 17
#13: 12 20 14 12
#14: 12 20 15 -7
#15: 12 21 14 3
#16: 12 21 15 9
#17: 12 25 14 6
#18: 12 25 15 -2
Upvotes: 1
Reputation: 1143
Edited based on my updated understanding of the problem 1-along is transect ID 2-cros is sampling point on a given transect 3- all transects were sampled in all years but not all sampling points on each transect were sampled in all years 4-The issue is to remove all sampling points that were not sampled for all six years
Here is a way to remove those rows:
require(plyr)
count_var<-ddply(dt, ~along+cros, summarise, count = length(year))
str(count_var)
dt<-merge(dt, count_var, by = c("along", "cros"), all.x = T)
dt_all6<-subset(dt, count==6)
To make dt with all sampling points from all years:
along<-as.factor(rep(1:113, 54))
year<-as.factor(c(rep(1979, 1017), rep(1980, 1017), rep(1981, 1017), rep(1982, 1017), rep(1983, 1017), rep(1984, 1017)))
cros_A<-c(rep(5, 113), rep(10, 113), rep(15, 113), rep(20, 113), rep(25, 113), rep(30, 113), rep(35, 113), rep(40, 113), rep(45, 113))
cros<-as.factor(rep(cros_A, 6))
set.seed(2)
alti<-rnorm(6102, mean = 1.5, sd = 0.5)
dt<-cbind.data.frame(along, year, cros, alti)
Now remove some sampling points:
dt<-dt[c(1:100, 106:400, 406:1500, 1506:1600, 1606:2500, 2506:3000, 3006:3500, 3506:4000, 4006:5000, 5006:6102), ]
Upvotes: 0