Reputation: 394
I have a data frame:
Title Date year lai biomass grain_wt wet_yield
1 HartogSowN 2014-07-31 2014 4.4 NA NA NA
2 HartogMild 2014-07-31 2014 3.7 NA NA NA
3 HartogSevere 2014-07-31 2014 2.3 NA NA NA
4 HartogSowN 2014-08-12 2014 6.1 NA NA NA
5 HartogMild 2014-08-12 2014 6.6 NA NA NA
6 HartogSevere 2014-08-12 2014 3.8 NA NA NA
7 HartogSowN 2014-11-10 2014 NA 16116 NA NA
8 HartogMild 2014-11-10 2014 NA 18224 NA NA
9 HartogSevere 2014-11-10 2014 NA 18184 NA NA
10 HartogSowN 2014-11-10 2014 NA NA 0.041 NA
11 HartogMild 2014-11-10 2014 NA NA 0.040 NA
12 HartogSevere 2014-11-10 2014 NA NA 0.038 NA
13 HartogSowN 2014-08-12 2014 NA 4511 NA NA
14 HartogMild 2014-08-12 2014 NA 4525 NA NA
15 HartogSevere 2014-08-12 2014 NA 3167 NA NA
16 HartogSowN 2014-07-31 2014 NA 2837 NA NA
17 HartogMild 2014-07-31 2014 NA 2444 NA NA
18 HartogSevere 2014-07-31 2014 NA 1940 NA NA
19 HartogSowN 2014-11-10 2014 NA NA NA 8457.4
20 HartogMild 2014-11-10 2014 NA NA NA 8662.4
21 HartogSevere 2014-11-10 2014 NA NA NA 8537.8
22 HartogSowN 2014-11-10 2014 NA NA NA NA
23 HartogMild 2014-11-10 2014 NA NA NA NA
24 HartogSevere 2014-11-10 2014 NA NA NA NA
structure(list(Title = c("HartogSowN", "HartogMild", "HartogSevere",
"HartogSowN", "HartogMild", "HartogSevere", "HartogSowN",
"HartogMild", "HartogSevere", "HartogSowN", "HartogMild",
"HartogSevere", "HartogSowN", "HartogMild", "HartogSevere",
"HartogSowN", "HartogMild", "HartogSevere", "HartogSowN",
"HartogMild", "HartogSevere", "HartogSowN", "HartogMild",
"HartogSevere"), Date = structure(c(16282, 16282, 16282, 16294,
16294, 16294, 16384, 16384, 16384, 16384, 16384, 16384, 16294, 16294,
16294, 16282, 16282, 16282, 16384, 16384, 16384, 16384, 16384,
16384), class = "Date"), year = c(2014, 2014, 2014, 2014, 2014, 2014,
2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014,
2014, 2014, 2014, 2014, 2014, 2014, 2014), lai = c(4.4,
3.7, 2.3, 6.1, 6.6, 3.8, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), biomass = c(NA, NA, NA, NA, NA, NA,
16116, 18224, 18184, NA, NA, NA, 4511, 4525, 3167, 2837, 2444, 1940,
NA, NA, NA, NA, NA, NA), grain_wt = c(NA, NA, NA, NA, NA, NA, NA, NA,
NA, 0.041, 0.04, 0.038, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA), wet_yield = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, 8457.4,
8662.4, 8537.8, NA, NA, NA)), .Names = c("Title", "Date", "year", "lai", "biomass", "grain_wt", "wet_yield"), row.names = c(NA, 24L),
class = "data.frame")
I want to collapse the rows such that all data for a given Title and Date combination are on one row with the extra rows removed. I have found answers to similar questions, but they all involve modifying the original data.
Desired output:
Title Date year lai biomass grain_wt wet_yield
1 HartogSowN 2014-07-31 2014 4.4 2837 NA NA
2 HartogMild 2014-07-31 2014 3.7 2444 NA NA
3 HartogSevere 2014-07-31 2014 2.3 1940 NA NA
4 HartogSowN 2014-08-12 2014 6.1 4511 NA NA
5 HartogMild 2014-08-12 2014 6.6 4525 NA NA
6 HartogSevere 2014-08-12 2014 3.8 3167 NA NA
7 HartogSowN 2014-11-10 2014 NA 16116 0.041 8457.4
8 HartogMild 2014-11-10 2014 NA 18224 0.040 8662.4
9 HartogSevere 2014-11-10 2014 NA 18184 0.038 8537.8
22 HartogSowN 2014-11-10 2014 NA NA NA NA
23 HartogMild 2014-11-10 2014 NA NA NA NA
24 HartogSevere 2014-11-10 2014 NA NA NA NA
with the extra rows that held the biomass, grain_wt and wet_yield being removed.
UPDATE: Thanks Pascal, yes the days should match, my mistake. I have updated the desired result.
UPDATE 2: Added full desired output for clarity.
Upvotes: 1
Views: 302
Reputation: 107567
Consider the following base R solution using aggregate(). Below uses median as the function but any aggregate should work (mean, min, max, etc.) but NAs will be handled differently.
# AGGREGATED DF
collapsedf <- aggregate(list(lai=df$lai,
biomass=df$biomass,
grain_wt=df$grain_wt,
wet_yield=df$wet_yield),
list(Title=df$Title, Date=df$Date, year=df$year),
FUN=median, na.rm=TRUE)
Or as @thelatemail simplifies:
collapsedf <- aggregate(df[c("lai","biomass","grain_wt","wet_yield")],
df[c("Title","Date","year")], FUN=median, na.rm=TRUE)
OUTPUT
Title Date year lai biomass grain_wt wet_yield
1 HartogMild 11/10/2014 2014 NA 18224 0.040 8662.4
2 HartogSevere 11/10/2014 2014 NA 18184 0.038 8537.8
3 HartogSowN 11/10/2014 2014 NA 16116 0.041 8457.4
4 HartogMild 7/31/2014 2014 3.7 2444 NA NA
5 HartogSevere 7/31/2014 2014 2.3 1940 NA NA
6 HartogSowN 7/31/2014 2014 4.4 2837 NA NA
7 HartogMild 8/12/2014 2014 6.6 4525 NA NA
8 HartogSevere 8/12/2014 2014 3.8 3167 NA NA
9 HartogSowN 8/12/2014 2014 6.1 4511 NA NA
Upvotes: 3
Reputation: 93803
Assuming that there is only one valid piece of data for each column for each Title/Date
combination, you can get your desired result with aggregate
:
aggregate(. ~ Title + Date + year, data=df,
FUN=function(x) x[!is.na(x)][1], na.action=na.pass)
# Title Date year lai biomass grain_wt wet_yield
#1 HartogMild 2014-07-31 2014 3.7 2444 NA NA
#2 HartogSevere 2014-07-31 2014 2.3 1940 NA NA
#3 HartogSowN 2014-07-31 2014 4.4 2837 NA NA
#4 HartogMild 2014-08-12 2014 6.6 4525 NA NA
#5 HartogSevere 2014-08-12 2014 3.8 3167 NA NA
#6 HartogSowN 2014-08-12 2014 6.1 4511 NA NA
#7 HartogMild 2014-11-10 2014 NA 18224 0.040 8662.4
#8 HartogSevere 2014-11-10 2014 NA 18184 0.038 8537.8
#9 HartogSowN 2014-11-10 2014 NA 16116 0.041 8457.4
This uses the Title + Date + year
as grouping variables, processing all remaining data columns .
The function simply returns the one non-missing piece of data - !is.na(x)
for each column, within each group.
The [1]
is needed to ensure that an NA
is returned if there are no non-missing pieces of data. E.g. - numeric(0)[1]
returns NA
.
The na.action=na.pass
is needed because aggregate
when used with a y ~ x
formula will throw out all rows with NA
values by default - na.action=na.omit
is the default.
Upvotes: 3