Sander Van der Zeeuw
Sander Van der Zeeuw

Reputation: 1092

how to divide all columns by sum of columns

I have a data set where i need to apply some simple normalization. What i want to do is to calculate the colSums with colSums(DF) and than i use the colSums to divide all the values inside of one column. This is what i did and it seems to work but i cannot see if the correct colSum has been used per column. My dataframe looks like this:

structure(list(`2E` = c(28L, 9736L, 20L, 221L, 349L, 21L), `2I` = c(42L, 
8254L, 0L, 292L, 106L, 0L), `6E` = c(49L, 4303L, 0L, 1L, 258L, 
0L), `6I` = c(0L, 3409L, 0L, 70L, 92L, 0L), `15E` = c(0L, 4178L, 
0L, 121L, 106L, 12L), `15I` = c(0L, 3L, 0L, 0L, 0L, 0L), `16E` = c(25L, 
9715L, 4L, 167L, 533L, 30L), `16I` = c(0L, 5082L, 12L, 112L, 
35L, 0L), `18E` = c(0L, 7425L, 0L, 134L, 324L, 0L), `18I` = c(0L, 
15822L, 0L, 565L, 78L, 0L), `20E` = c(0L, 69881L, 0L, 2240L, 
3764L, 189L), `20I` = c(0L, 27718L, 0L, 837L, 312L, 239L), `21E` = c(0L, 
8841L, 5L, 241L, 458L, 12L), `21I` = c(0L, 308L, 0L, 9L, 14L, 
0L), `22E` = c(52L, 34347L, 0L, 523L, 1861L, 44L), `22I` = c(0L, 
4202L, 0L, 152L, 58L, 0L), `23E` = c(0L, 3742L, 0L, 30L, 185L, 
0L), `23I` = c(31L, 3766L, 0L, 108L, 38L, 12L), `25E` = c(0L, 
3647L, 0L, 26L, 189L, 0L), `25I` = c(0L, 11243L, 0L, 903L, 85L, 
168L), `26E` = c(0L, 8162L, 0L, 56L, 753L, 0L), `26I` = c(0L, 
6325L, 3L, 229L, 85L, 0L), `27E` = c(22L, 7548L, 0L, 119L, 213L, 
0L), `27I` = c(4L, 8949L, 0L, 1009L, 114L, 0L), `28E` = c(0L, 
6103L, 0L, 100L, 319L, 68L), `28I` = c(0L, 13306L, 0L, 582L, 
57L, 0L), `29E` = c(0L, 3608L, 9L, 54L, 142L, 27L), `29I` = c(0L, 
5035L, 0L, 138L, 84L, 0L), `30E` = c(0L, 27795L, 0L, 593L, 1680L, 
35L), `30I` = c(0L, 5506L, 0L, 146L, 75L, 0L), `32E` = c(13L, 
12516L, 22L, 230L, 745L, 17L), `32I` = c(0L, 1271L, 0L, 29L, 
13L, 0L), `33E` = c(0L, 3551L, 0L, 0L, 148L, 0L), `33I` = c(0L, 
15957L, 0L, 550L, 1L, 0L), `34E` = c(0L, 1852L, 0L, 18L, 138L, 
0L), `34I` = c(0L, 10469L, 0L, 243L, 119L, 0L), `35E` = c(0L, 
9570L, 0L, 362L, 671L, 0L), `35I` = c(19L, 4953L, 0L, 25L, 32L, 
23L), `36E` = c(0L, 2497L, 15L, 55L, 125L, 4L), `36I` = c(0L, 
1839L, 11L, 39L, 0L, 0L), `38E` = c(0L, 940L, 0L, 38L, 50L, 0L
), `38I` = c(0L, 2301L, 0L, 60L, 14L, 8L), `39E` = c(0L, 5324L, 
0L, 107L, 92L, 41L), `39I` = c(0L, 8360L, 0L, 262L, 13L, 0L), 
    `40E` = c(15L, 6107L, 10L, 183L, 173L, 13L), `40I` = c(8L, 
    1517L, 0L, 16L, 10L, 0L), `42E` = c(0L, 14681L, 35L, 312L, 
    282L, 54L), `42I` = c(0L, 7385L, 1L, 138L, 48L, 0L)), .Names = c("2E", 
"2I", "6E", "6I", "15E", "15I", "16E", "16I", "18E", "18I", "20E", 
"20I", "21E", "21I", "22E", "22I", "23E", "23I", "25E", "25I", 
"26E", "26I", "27E", "27I", "28E", "28I", "29E", "29I", "30E", 
"30I", "32E", "32I", "33E", "33I", "34E", "34I", "35E", "35I", 
"36E", "36I", "38E", "38I", "39E", "39I", "40E", "40I", "42E", 
"42I"), row.names = c("DQ459412", "DQ459413", "DQ459415", "DQ459418", 
"DQ459419", "DQ459420"), class = "data.frame")

So i have my dataframe, calculate the colSums. And then just simply did counts / colSums. Will this now use all values inside colSums or just the first one?

What is also important to know is that colSums should use the same colname as in the count dataframe to divide to. So the colSums of one column should be used to divide this column by.

Upvotes: 0

Views: 2250

Answers (2)

Daniel Falbel
Daniel Falbel

Reputation: 1713

Look what R is doing when you make a data.frame/vector

> x  <-  data.frame(x = rep(1, 5), y = rep(1, 5))
> x/c(1,2)
x   y
1 1.0 0.5
2 0.5 1.0
3 1.0 0.5
4 0.5 1.0
5 1.0 0.5

Its the same when you make data.frame/colSums(data.frame)

Upvotes: 1

nicola
nicola

Reputation: 24480

Two things you need to know to properly understand what's going on when you try to divide DF by colSums(DF).

  1. R stores its arrays following the column-major order, that means that, if you a have a NxM matrix, the second element of the array will be the [2,1] (and not the [1,2]).

  2. Arithmetic operations in R are vectorized. When you divide a vector by another vector, R will divide the first element of the first vector by the first of the second vector, then the second by the second, the third by the third and so on, recycling the shorter if necessary.

So, what happens when you try DF/colSums(DF)? The first operand is coerced to a matrix and the second is a vector. The first element of the resulting object will be DF[1,1]/colSums(DF)[1]. So far so good. But the second will be DF[2,1]/colSums(DF)[2]: that's not what we want! We wanted DF[2,1]/colSums(DF)[1] instead, since we still are in the first column.

If you understood what happens here, you should be able to find a way to achieve what you want.

Upvotes: 2

Related Questions