Justin
Justin

Reputation: 394

Merging similar rows in a data frame

I have a data frame:

       Title       Date    year lai biomass grain_wt wet_yield
1    HartogSowN 2014-07-31 2014 4.4      NA       NA        NA
2    HartogMild 2014-07-31 2014 3.7      NA       NA        NA
3  HartogSevere 2014-07-31 2014 2.3      NA       NA        NA
4    HartogSowN 2014-08-12 2014 6.1      NA       NA        NA
5    HartogMild 2014-08-12 2014 6.6      NA       NA        NA
6  HartogSevere 2014-08-12 2014 3.8      NA       NA        NA
7    HartogSowN 2014-11-10 2014  NA   16116       NA        NA
8    HartogMild 2014-11-10 2014  NA   18224       NA        NA
9  HartogSevere 2014-11-10 2014  NA   18184       NA        NA
10   HartogSowN 2014-11-10 2014  NA      NA    0.041        NA
11   HartogMild 2014-11-10 2014  NA      NA    0.040        NA
12 HartogSevere 2014-11-10 2014  NA      NA    0.038        NA
13   HartogSowN 2014-08-12 2014  NA    4511       NA        NA
14   HartogMild 2014-08-12 2014  NA    4525       NA        NA
15 HartogSevere 2014-08-12 2014  NA    3167       NA        NA
16   HartogSowN 2014-07-31 2014  NA    2837       NA        NA
17   HartogMild 2014-07-31 2014  NA    2444       NA        NA
18 HartogSevere 2014-07-31 2014  NA    1940       NA        NA
19   HartogSowN 2014-11-10 2014  NA      NA       NA    8457.4
20   HartogMild 2014-11-10 2014  NA      NA       NA    8662.4
21 HartogSevere 2014-11-10 2014  NA      NA       NA    8537.8
22   HartogSowN 2014-11-10 2014  NA      NA       NA        NA
23   HartogMild 2014-11-10 2014  NA      NA       NA        NA
24 HartogSevere 2014-11-10 2014  NA      NA       NA        NA

structure(list(Title = c("HartogSowN", "HartogMild", "HartogSevere", 
"HartogSowN", "HartogMild", "HartogSevere", "HartogSowN",
"HartogMild",  "HartogSevere", "HartogSowN", "HartogMild",
"HartogSevere", "HartogSowN",  "HartogMild", "HartogSevere",
"HartogSowN", "HartogMild", "HartogSevere",  "HartogSowN",
"HartogMild", "HartogSevere", "HartogSowN", "HartogMild", 
"HartogSevere"), Date = structure(c(16282, 16282, 16282, 16294, 
16294, 16294, 16384, 16384, 16384, 16384, 16384, 16384, 16294,  16294,
16294, 16282, 16282, 16282, 16384, 16384, 16384, 16384,  16384,
16384), class = "Date"), year = c(2014, 2014, 2014, 2014,  2014, 2014,
2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014,  2014, 2014,
2014, 2014, 2014, 2014, 2014, 2014, 2014), lai = c(4.4, 
3.7, 2.3, 6.1, 6.6, 3.8, NA, NA, NA, NA, NA, NA, NA, NA, NA,  NA, NA, NA, NA, NA, NA, NA, NA, NA), biomass = c(NA, NA, NA,  NA, NA, NA,
16116, 18224, 18184, NA, NA, NA, 4511, 4525, 3167,  2837, 2444, 1940,
NA, NA, NA, NA, NA, NA), grain_wt = c(NA, NA,  NA, NA, NA, NA, NA, NA,
NA, 0.041, 0.04, 0.038, NA, NA, NA, NA,  NA, NA, NA, NA, NA, NA, NA,
NA), wet_yield = c(NA, NA, NA, NA,  NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, 8457.4, 
8662.4, 8537.8, NA, NA, NA)), .Names = c("Title", "Date", "year",  "lai", "biomass", "grain_wt", "wet_yield"), row.names = c(NA,  24L),
class = "data.frame")

I want to collapse the rows such that all data for a given Title and Date combination are on one row with the extra rows removed. I have found answers to similar questions, but they all involve modifying the original data.

Desired output:

       Title       Date    year lai biomass grain_wt wet_yield
1    HartogSowN 2014-07-31 2014 4.4    2837       NA        NA
2    HartogMild 2014-07-31 2014 3.7    2444       NA        NA
3  HartogSevere 2014-07-31 2014 2.3    1940       NA        NA
4    HartogSowN 2014-08-12 2014 6.1    4511       NA        NA
5    HartogMild 2014-08-12 2014 6.6    4525       NA        NA
6  HartogSevere 2014-08-12 2014 3.8    3167       NA        NA
7    HartogSowN 2014-11-10 2014  NA   16116    0.041    8457.4
8    HartogMild 2014-11-10 2014  NA   18224    0.040    8662.4
9  HartogSevere 2014-11-10 2014  NA   18184    0.038    8537.8
22   HartogSowN 2014-11-10 2014  NA      NA       NA        NA
23   HartogMild 2014-11-10 2014  NA      NA       NA        NA
24 HartogSevere 2014-11-10 2014  NA      NA       NA        NA

with the extra rows that held the biomass, grain_wt and wet_yield being removed.

UPDATE: Thanks Pascal, yes the days should match, my mistake. I have updated the desired result.

UPDATE 2: Added full desired output for clarity.

Upvotes: 1

Views: 302

Answers (2)

Parfait
Parfait

Reputation: 107567

Consider the following base R solution using aggregate(). Below uses median as the function but any aggregate should work (mean, min, max, etc.) but NAs will be handled differently.

# AGGREGATED DF 
collapsedf <- aggregate(list(lai=df$lai,
                             biomass=df$biomass, 
                             grain_wt=df$grain_wt, 
                             wet_yield=df$wet_yield), 
                        list(Title=df$Title, Date=df$Date, year=df$year), 
                        FUN=median, na.rm=TRUE)

Or as @thelatemail simplifies:

collapsedf <- aggregate(df[c("lai","biomass","grain_wt","wet_yield")], 
                        df[c("Title","Date","year")], FUN=median, na.rm=TRUE)

OUTPUT

    Title           Date        year    lai   biomass    grain_wt   wet_yield
1   HartogMild      11/10/2014  2014    NA    18224      0.040      8662.4
2   HartogSevere    11/10/2014  2014    NA    18184      0.038      8537.8
3   HartogSowN      11/10/2014  2014    NA    16116      0.041      8457.4
4   HartogMild       7/31/2014  2014    3.7   2444       NA         NA
5   HartogSevere     7/31/2014  2014    2.3   1940       NA         NA
6   HartogSowN       7/31/2014  2014    4.4   2837       NA         NA
7   HartogMild       8/12/2014  2014    6.6   4525       NA         NA
8   HartogSevere     8/12/2014  2014    3.8   3167       NA         NA
9   HartogSowN       8/12/2014  2014    6.1   4511       NA         NA

Upvotes: 3

thelatemail
thelatemail

Reputation: 93803

Assuming that there is only one valid piece of data for each column for each Title/Date combination, you can get your desired result with aggregate:

aggregate(. ~ Title + Date + year, data=df,
          FUN=function(x) x[!is.na(x)][1], na.action=na.pass)

#         Title       Date year lai biomass grain_wt wet_yield
#1   HartogMild 2014-07-31 2014 3.7    2444       NA        NA
#2 HartogSevere 2014-07-31 2014 2.3    1940       NA        NA
#3   HartogSowN 2014-07-31 2014 4.4    2837       NA        NA
#4   HartogMild 2014-08-12 2014 6.6    4525       NA        NA
#5 HartogSevere 2014-08-12 2014 3.8    3167       NA        NA
#6   HartogSowN 2014-08-12 2014 6.1    4511       NA        NA
#7   HartogMild 2014-11-10 2014  NA   18224    0.040    8662.4
#8 HartogSevere 2014-11-10 2014  NA   18184    0.038    8537.8
#9   HartogSowN 2014-11-10 2014  NA   16116    0.041    8457.4

This uses the Title + Date + year as grouping variables, processing all remaining data columns .

The function simply returns the one non-missing piece of data - !is.na(x) for each column, within each group.

The [1] is needed to ensure that an NA is returned if there are no non-missing pieces of data. E.g. - numeric(0)[1] returns NA.

The na.action=na.pass is needed because aggregate when used with a y ~ x formula will throw out all rows with NA values by default - na.action=na.omit is the default.

Upvotes: 3

Related Questions