Mal_a
Mal_a

Reputation: 3760

R: median of specific value rows

I need to get a median value of rows based on another column (SEQ). Let me explain You on base of the sample data:

data <- structure(list(DATUM = structure(c(1335558400, 1335558400, 1335558400, 
                                           1335558400, 1335562429, 1335562429, 1335562429, 1335562429, 1335562429, 
                                           1335562429, 1335562429, 1335562429, 1335562429, 1335562429, 1335562429, 
                                           1335567274, 1335567274, 1335567274, 1335567274, 1335567274, 1335567274, 
                                           1335567274, 1335567274, 1335567274, 1335567274, 1335567274, 1335681543, 
                                           1335681543, 1335681543, 1335681543), class = c("POSIXct", "POSIXt"
                                           )), CHGNR = c(200028, 200028, 200028, 200028, 200029, 200029, 
                                                         200029, 200029, 200029, 200029, 200029, 200029, 200029, 200029, 
                                                         200029, 200029, 200029, 200029, 200029, 200029, 200029, 200029, 
                                                         200029, 200029, 200029, 200029, 200057, 200057, 200057, 200057
                                           ), SEQ = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
                                                      2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 2L, 2L, 
                                                      2L
                                          ), PROBE = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 
                                                       2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 1, 2, 2, 
                                                     2
                                          ), METHODE = c("a", "a", "b", "b", "a", "a", "a", "a", "a", "b", "b", "b", "b", "a", "a",
                                                         "a", "a", "a", "a", "b", "b", "b", "b", "b", "b", "a", "a", "a", "b", "b"
                                          ),MITTELWERT = c(2.5,1.67, 0, 0, 0.5, 0.333333333333333, 0.5, 0, 0, 0, 0, 
                                                             0, 5, 0, 0.833333333333333, 0, 0, 0, 0, 0, 0.5, 1, 0, 0, 
                                                             0, 0.5, 0.666666666666667,1, 0, 0)
                       ), .Names = c("DATUM", "CHGNR", "SEQ","PROBE","METHODE", "MITTELWERT"), 
                  row.names = c(NA,30L), class = "data.frame")

So the data looks like that:

                 DATUM  CHGNR SEQ PROBE METHODE MITTELWERT
1  2012-04-27 22:26:40 200028   1     1       a  2.5000000
2  2012-04-27 22:26:40 200028   1     1       a  1.6700000
3  2012-04-27 22:26:40 200028   1     1       b  0.0000000
4  2012-04-27 22:26:40 200028   1     1       b  0.0000000
5  2012-04-27 23:33:49 200029   1     1       a  0.5000000
6  2012-04-27 23:33:49 200029   1     1       a  0.3333333
7  2012-04-27 23:33:49 200029   2     2       a  0.5000000
8  2012-04-27 23:33:49 200029   2     2       a  0.0000000
9  2012-04-27 23:33:49 200029   2     2       a  0.0000000
10 2012-04-27 23:33:49 200029   2     2       b  0.0000000
11 2012-04-27 23:33:49 200029   2     2       b  0.0000000
12 2012-04-27 23:33:49 200029   2     2       b  0.0000000
13 2012-04-27 23:33:49 200029   2     2       b  5.0000000
14 2012-04-27 23:33:49 200029   2     2       a  0.0000000
15 2012-04-27 23:33:49 200029   2     2       a  0.8333333
16 2012-04-28 00:54:34 200029   3     2       a  0.0000000
17 2012-04-28 00:54:34 200029   3     2       a  0.0000000
18 2012-04-28 00:54:34 200029   3     2       a  0.0000000
19 2012-04-28 00:54:34 200029   3     2       a  0.0000000
20 2012-04-28 00:54:34 200029   3     2       b  0.0000000
21 2012-04-28 00:54:34 200029   3     3       b  0.5000000
22 2012-04-28 00:54:34 200029   3     3       b  1.0000000
23 2012-04-28 00:54:34 200029   3     3       b  0.0000000
24 2012-04-28 00:54:34 200029   3     3       b  0.0000000
25 2012-04-28 00:54:34 200029   3     3       b  0.0000000
26 2012-04-28 00:54:34 200029   3     3       a  0.5000000
27 2012-04-29 08:39:03 200057   1     1       a  0.6666667
28 2012-04-29 08:39:03 200057   2     2       a  1.0000000
29 2012-04-29 08:39:03 200057   2     2       b  0.0000000
30 2012-04-29 08:39:03 200057   2     2       b  0.0000000

I would like to get a new row with median of column MITTELWERT for each unique CHGNR and METHODE and ONLY for SEQ > 1 (+ the first value of the columns DATUM and PROBE). Therefore in this sample data, it would look like that:

                 DATUM  CHGNR SEQ PROBE METHODE MITTELWERT
1  2012-04-27 22:26:40 200028   1     1       a  2.5000000
2  2012-04-27 22:26:40 200028   1     1       a  1.6700000
3  2012-04-27 22:26:40 200028   1     1       b  0.0000000
4  2012-04-27 22:26:40 200028   1     1       b  0.0000000
5  2012-04-27 23:33:49 200029   1     1       a  0.5000000
6  2012-04-27 23:33:49 200029   1     1       a  0.3333333
7  2012-04-27 23:33:49 200029   2     2       a  0.5000000
8  2012-04-27 23:33:49 200029   2     2       a  0.0000000
9  2012-04-27 23:33:49 200029   2     2       a  0.0000000
10 2012-04-27 23:33:49 200029   2     2       b  0.0000000
11 2012-04-27 23:33:49 200029   2     2       b  0.0000000
12 2012-04-27 23:33:49 200029   2     2       b  0.0000000
13 2012-04-27 23:33:49 200029   2     2       b  5.0000000
14 2012-04-27 23:33:49 200029   2     2       a  0.0000000
15 2012-04-27 23:33:49 200029   2     2       a  0.8333333
16 2012-04-28 00:54:34 200029   3     2       a  0.0000000
17 2012-04-28 00:54:34 200029   3     2       a  0.0000000
18 2012-04-28 00:54:34 200029   3     2       a  0.0000000
19 2012-04-28 00:54:34 200029   3     2       a  0.0000000
20 2012-04-28 00:54:34 200029   3     2       b  0.0000000
21 2012-04-28 00:54:34 200029   3     3       b  0.5000000
22 2012-04-28 00:54:34 200029   3     3       b  1.0000000
23 2012-04-28 00:54:34 200029   3     3       b  0.0000000
24 2012-04-28 00:54:34 200029   3     3       b  0.0000000
25 2012-04-28 00:54:34 200029   3     3       b  0.0000000
26 2012-04-28 00:54:34 200029   3     3       a  0.5000000
27 2012-04-27 23:33:49 200029   >1    2       a  0.0000000 #new calculated row with median value for unique CHGNR and METHODE "a"
28 2012-04-27 23:33:49 200029   >1    2       b  0.0000000 #new calculated row with median value for unique CHGNR and METHODE "b"
...

Thanks for any tips!

Upvotes: 2

Views: 425

Answers (3)

aichao
aichao

Reputation: 7455

You can use dplyr:

library(dplyr)
out <- data %>% group_by(CHGNR,METHODE) %>% 
                summarise(DATUM=DATUM[1],PROBE=PROBE[1],MITTELWERT=median(MITTELWERT[SEQ>1]),SEQ=">1") %>%
                ungroup %>% filter(!is.na(MITTELWERT)) %>% 
                select(DATUM,CHGNR,SEQ,PROBE,METHODE,MITTELWERT) #%>%

We group_by both CHGNR and METHODE to split the rows into groups for each CHGNR and METHODE. Then summarise with median to compute the median MITTELWERT. Note that MITTELWERT is subset by SEQ>1. In summarise, we also summarize DATUM and PROBE with their first elements and we set SEQ to ">1". Finally, we ungroup, remove those rows with NA and reorder the columns to match the original data.

To add the result to data as additional rows, we need to change the SEQ column to character. Then, we can use bind_rows.

data$SEQ <- as.character(data$SEQ)
out <- bind_rows(data,out)
                 DATUM  CHGNR SEQ PROBE METHODE MITTELWERT
1  2012-04-27 16:26:40 200028   1     1       a  2.5000000
2  2012-04-27 16:26:40 200028   1     1       a  1.6700000
3  2012-04-27 16:26:40 200028   1     1       b  0.0000000
4  2012-04-27 16:26:40 200028   1     1       b  0.0000000
5  2012-04-27 17:33:49 200029   1     1       a  0.5000000
6  2012-04-27 17:33:49 200029   1     1       a  0.3333333
7  2012-04-27 17:33:49 200029   2     2       a  0.5000000
8  2012-04-27 17:33:49 200029   2     2       a  0.0000000
9  2012-04-27 17:33:49 200029   2     2       a  0.0000000
10 2012-04-27 17:33:49 200029   2     2       b  0.0000000
11 2012-04-27 17:33:49 200029   2     2       b  0.0000000
12 2012-04-27 17:33:49 200029   2     2       b  0.0000000
13 2012-04-27 17:33:49 200029   2     2       b  5.0000000
14 2012-04-27 17:33:49 200029   2     2       a  0.0000000
15 2012-04-27 17:33:49 200029   2     2       a  0.8333333
16 2012-04-27 18:54:34 200029   3     2       a  0.0000000
17 2012-04-27 18:54:34 200029   3     2       a  0.0000000
18 2012-04-27 18:54:34 200029   3     2       a  0.0000000
19 2012-04-27 18:54:34 200029   3     2       a  0.0000000
20 2012-04-27 18:54:34 200029   3     2       b  0.0000000
21 2012-04-27 18:54:34 200029   3     3       b  0.5000000
22 2012-04-27 18:54:34 200029   3     3       b  1.0000000
23 2012-04-27 18:54:34 200029   3     3       b  0.0000000
24 2012-04-27 18:54:34 200029   3     3       b  0.0000000
25 2012-04-27 18:54:34 200029   3     3       b  0.0000000
26 2012-04-27 18:54:34 200029   3     3       a  0.5000000
27 2012-04-29 02:39:03 200057   1     1       a  0.6666667
28 2012-04-29 02:39:03 200057   2     2       a  1.0000000
29 2012-04-29 02:39:03 200057   2     2       b  0.0000000
30 2012-04-29 02:39:03 200057   2     2       b  0.0000000
31 2012-04-27 17:33:49 200029  >1     1       a  0.0000000
32 2012-04-27 17:33:49 200029  >1     2       b  0.0000000
33 2012-04-29 02:39:03 200057  >1     1       a  1.0000000
34 2012-04-29 02:39:03 200057  >1     2       b  0.0000000

Upvotes: 2

T.Holme
T.Holme

Reputation: 538

This can be achieved with library(data.table).

library(data.table)    
# Create data here ----      
data <- data.table(data) # Converts to data set

crunchedData <- data[SEQ > 1,
               j = .(SEQ = ">1",MITTELWERT = median(MITTELWERT) ),
              by = .(CHGNR,METHODE)]

From here you can rbind crunchedData onto data using whatever values of DATUM and PROBE you like.

If you're not familiar with data.table, the first parameter (i) takes logic (similar to data frames) which subsets by row. j is filled with columnwise operations (where each column can be referred to by its column name) and by (or k, if you like) defines the groups to group rows by.

Upvotes: 9

Chirayu Chamoli
Chirayu Chamoli

Reputation: 2076

You can compute group median using:

library(sqldf)
df1=sqldf("select DATUM,CHGNR,SEQ,PROBE,METHODE, median(MITTELWERT) as MITTELWERT from df where SEQ>1 group by CHGNR,METHODE")

You can then do further modification to the dataframe and then do a rbind.

tail(rbind(df,df1))

               DATUM  CHGNR SEQ PROBE METHODE MITTELWERT
29 2012-04-29 12:09:03 200057   2     2       b          0
30 2012-04-29 12:09:03 200057   2     2       b          0
31 2012-04-28 04:24:34 200029   3     3       a          0
32 2012-04-28 04:24:34 200029   3     3       b          0
33 2012-04-29 12:09:03 200057   2     2       a          1
34 2012-04-29 12:09:03 200057   2     2       b          0

Upvotes: 3

Related Questions