Reputation: 5897

Calculating the Medians and Means of Rows (in R)

I am using R programming language. Suppose I have the following data ("my_data"):

   student first_run second_run third_run fourth_run fifth_run sixth_run seventh_run eight_run ninth_run tenth_run
1   student1  19.70847   21.79771  16.49083   19.51691  13.97987  14.60733    13.89703  15.24651  20.75679  18.44020
2   student2  11.22369   15.36253  16.90215   20.20724  15.90227  15.14539    13.74945  18.30090  19.55124  17.24132
3   student3  15.93649   17.03599  14.20214   13.17548  14.70327  15.49697    13.08945  19.94142  22.41674  17.37958
4   student4  16.18733   15.13197  14.79481   16.75177  14.51287  17.71816    13.45054  14.25553  19.89091  18.88981
5   student5  18.71084   18.85453  17.15864   19.38880  15.68862  18.39169    15.26428  16.04526  18.92532  16.62409
6   student6  19.75246   12.74605  18.52214   17.92626  14.48501  17.20780    13.10512  12.46502  20.68583  15.87711
7   student7  14.75144   23.82376  18.51366   20.77424  14.22155  16.08186    12.95981  12.67820  20.12166  15.66006
8   student8  17.06516   15.63075  13.72026   15.02068  14.21098  15.99414    14.64818  16.15603  21.74607  17.07382
9   student9  20.27611   12.44592  12.26502   15.13456  14.61552  18.72192    15.11129  17.60746  18.83831  17.55257
10 student10  17.70736   16.21620  14.10861   17.20014  16.59376  19.50027    13.05073  15.80002  18.09781  18.34313

I want to add 2 columns to this data:

my_mean : the mean of each row
my_median: the median of each row

I tried the following code in R:

my_data$median = apply(my_data, 1, median, na.rm=T)

my_data$mean = apply(my_data, 1, mean, na.rm=T)

But I don't think this code is correct. For instance, when using this code, the median of the second row of data is returned as "16.90215"

But when I manually take the median of this row:

median(11.22369  , 15.36253 , 16.90215 ,  20.20724,  15.90227 , 15.14539   , 13.74945 , 18.30090 , 19.55124 , 17.24132)

I get an answer of

11.22

Can someone please show me what I am doing wrong?

Thanks

Upvotes: 1

Answers (4)

jay.sf

Reputation: 72758

You could definitely benefit from the speed of matrixStats library.

matrixStats::rowMedians(as.matrix(d[-1]))
# [1] 17.46551 16.40221 15.71673 15.65965 17.77517 16.54246 15.87096 15.81245 16.34356 16.89695
matrixStats::rowMeans2(as.matrix(d[-1]))
# [1] 17.44417 16.35862 16.33775 16.15837 17.50521 16.27728 16.95862 16.12661 16.25687 16.66180

stopifnot(all.equal(matrixStats::rowMedians(as.matrix(d[-1])),
                    as.numeric(apply(d[-1], 1, median, na.rm=T))))
stopifnot(all.equal(matrixStats::rowMeans2(as.matrix(d[-1])),
                    as.numeric(apply(d[-1], 1, mean, na.rm=T))))

Data:

d <- structure(list(student = c("student1", "student2", "student3", 
"student4", "student5", "student6", "student7", "student8", "student9", 
"student10"), first_run = c(19.70847, 11.22369, 15.93649, 16.18733, 
18.71084, 19.75246, 14.75144, 17.06516, 20.27611, 17.70736), 
    second_run = c(21.79771, 15.36253, 17.03599, 15.13197, 18.85453, 
    12.74605, 23.82376, 15.63075, 12.44592, 16.2162), third_run = c(16.49083, 
    16.90215, 14.20214, 14.79481, 17.15864, 18.52214, 18.51366, 
    13.72026, 12.26502, 14.10861), fourth_run = c(19.51691, 20.20724, 
    13.17548, 16.75177, 19.3888, 17.92626, 20.77424, 15.02068, 
    15.13456, 17.20014), fifth_run = c(13.97987, 15.90227, 14.70327, 
    14.51287, 15.68862, 14.48501, 14.22155, 14.21098, 14.61552, 
    16.59376), sixth_run = c(14.60733, 15.14539, 15.49697, 17.71816, 
    18.39169, 17.2078, 16.08186, 15.99414, 18.72192, 19.50027
    ), seventh_run = c(13.89703, 13.74945, 13.08945, 13.45054, 
    15.26428, 13.10512, 12.95981, 14.64818, 15.11129, 13.05073
    ), eight_run = c(15.24651, 18.3009, 19.94142, 14.25553, 16.04526, 
    12.46502, 12.6782, 16.15603, 17.60746, 15.80002), ninth_run = c(20.75679, 
    19.55124, 22.41674, 19.89091, 18.92532, 20.68583, 20.12166, 
    21.74607, 18.83831, 18.09781), tenth_run = c(18.4402, 17.24132, 
    17.37958, 18.88981, 16.62409, 15.87711, 15.66006, 17.07382, 
    17.55257, 18.34313)), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10"))

Upvotes: 1

TarJae

Reputation: 78927

Here is an alternative using pmap along with passing all the arguments simultaneously thus using ellipsis i.e. .... The output is needed to be unnested with unnest_wider from tidyr:

library(tidyr)
library(dplyr)
library(purrr)
df %>% 
  mutate(res = pmap(across(where(is.numeric)),
                    ~ list(median = median(c(...)),
                           avg = mean(c(...))))) %>%
  unnest_wider(res)

output:

  student   first_run second_run third_run fourth_run fifth_run sixth_run seventh_run eight_run ninth_run tenth_run median   avg
   <chr>         <dbl>      <dbl>     <dbl>      <dbl>     <dbl>     <dbl>       <dbl>     <dbl>     <dbl>     <dbl>  <dbl> <dbl>
 1 student1       19.7       21.8      16.5       19.5      14.0      14.6        13.9      15.2      20.8      18.4   17.5  17.4
 2 student2       11.2       15.4      16.9       20.2      15.9      15.1        13.7      18.3      19.6      17.2   16.4  16.4
 3 student3       15.9       17.0      14.2       13.2      14.7      15.5        13.1      19.9      22.4      17.4   15.7  16.3
 4 student4       16.2       15.1      14.8       16.8      14.5      17.7        13.5      14.3      19.9      18.9   15.7  16.2
 5 student5       18.7       18.9      17.2       19.4      15.7      18.4        15.3      16.0      18.9      16.6   17.8  17.5
 6 student6       19.8       12.7      18.5       17.9      14.5      17.2        13.1      12.5      20.7      15.9   16.5  16.3
 7 student7       14.8       23.8      18.5       20.8      14.2      16.1        13.0      12.7      20.1      15.7   15.9  17.0
 8 student8       17.1       15.6      13.7       15.0      14.2      16.0        14.6      16.2      21.7      17.1   15.8  16.1
 9 student9       20.3       12.4      12.3       15.1      14.6      18.7        15.1      17.6      18.8      17.6   16.3  16.3
10 student10      17.7       16.2      14.1       17.2      16.6      19.5        13.1      15.8      18.1      18.3   16.9  16.7

Upvotes: 1

LMc

Reputation: 18632

library(dplyr)

df %>% 
  rowwise() %>% 
  mutate(median = median(c_across(where(is.numeric))),
         mean = mean(c_across(where(is.numeric))))

c_across and rowwise were created for this type of situation. Most verbs work column-wise. To change this behavior pipe to rowwise first.

c_across will then combine all values in a row that are numeric (hence where(is.numeric) into a numeric vector and then mean or median can be applied.

Note: You will likely want to pipe the output to ungroup since rowwise creates a rowwise grouped data frame.

Upvotes: 1

akrun

Reputation: 887078

The calculation is incorrect i.e. the first argument of median is 'x' which can be a vector. The second argument is na.rm, followed by variadic arguments .... So, when write 11.22369, 15.36253, the 'x' is taken as 11.22369 and that is the value returned. Instead, it should be a vector by concatenation c

median(c(11.22369  , 15.36253 , 16.90215 ,  20.20724,  15.90227 , 15.14539   , 13.74945 , 18.30090 , 19.55124 , 17.24132))
[1] 16.40221

Also, based on the OP's data, the first column should be dropped which is character or factor

 apply(my_data[-1], 1, median, na.rm=TRUE)
       1        2        3        4        5        6        7        8        9       10 
17.46551 16.40221 15.71673 15.65965 17.77517 16.54246 15.87096 15.81245 16.34356 16.89695

The second row is used in the manual calculation

Upvotes: 2

Calculating the Medians and Means of Rows (in R)

Answers (4)

Related Questions