Workhorse
Workhorse

Reputation: 1560

Row Means based on Column Substring

I have a dataframe that looks like this:

df <- data.frame("CB_1.1"=c(0,5,6,2), "CB_1.16"=c(1,5,3,6), "HC_2.11"=c(3,3,4,5), "HC_1.12"=c(2,3,4,5), "HC_1.13"=c(1,0,0,5))

> df
  CB_1.1 CB_1.16 HC_2.11 HC_1.12 HC_1.13
1      0       1       3       2       1
2      5       5       3       3       0
3      6       3       4       4       0
4      2       6       5       5       5

I would like to take the mean of rows that share substring of the column name, before the ".". Resulting in a dataframe like this:

  CB_1 HC_2 HC_1
1  0.5    3  1.5
2  5.0    3  1.5
3  4.5    4  2.0
4  4.0    5  5.0

You'll notice that the column HC_2.11 values remain the same, because no other column has HC_2 in this dataframe.

Any help would be appreciated!

Upvotes: 4

Views: 285

Answers (4)

G. Grothendieck
G. Grothendieck

Reputation: 269854

1) apply/tapply For each row use tapply on it using an INDEX of the name prefixes and a function mean. Transpose the result. No packages are used.

prefix <- sub("\\..*", "", names(df))
t(apply(df, 1, tapply, prefix, mean))

giving this matrix (wrap it in data.frame(...) if you need a data frame result):

     CB_1 HC_1 HC_2
[1,]  0.5  1.5    3
[2,]  5.0  1.5    3
[3,]  4.5  2.0    4
[4,]  4.0  5.0    5

2) lm Run the indicated regression. The +0 in the formula means don't add on an intercept. The transpose of the coefficients will be the required matrix, m. In the next line make the names nicer. prefix is from (1). No packages are used.

m <- t(coef(lm(t(df) ~ prefix + 0)))
colnames(m) <- sub("prefix", "", colnames(m))
m

giving this matrix

     CB_1 HC_1 HC_2
[1,]  0.5  1.5    3
[2,]  5.0  1.5    3
[3,]  4.5  2.0    4
[4,]  4.0  5.0    5

This follows from the facts that (1) the model matrix X contains only ones and zeros and (2) distinct columns of it are orthogonal. The model matrix is shown here:

X <- model.matrix(~ prefix + 0) # model matrix
X

giving:

  prefixCB_1 prefixHC_1 prefixHC_2
1          1          0          0
2          1          0          0
3          0          0          1
4          0          1          0
5          0          1          0
attr(,"assign")
[1] 1 1 1
attr(,"contrasts")
attr(,"contrasts")$prefix
[1] "contr.treatment"

Because the columns of the model matrix X are orthogonal the coefficient corresponding to any column for a particular row, y, of df (column of t(df)) is just sum(x * y) / sum(x * x) and since x is a 0/1 vector that equals the mean of the values of y corresponding to the 1's in x.

3) stack/tapply Convert to long form inserting an id column at the same time. Then use tapply to convert back to wide form tapply-ing mean. No packages are used.

long <- transform(stack(df), ind = sub("\\..*", "", ind), id = c(row(df)))    
with(long, tapply(values, long[c("id", "ind")], mean))

giving this table. Wrap it in as.data.frame.matrix if you want a data.frame.

   ind
id  CB_1 HC_1 HC_2
  1  0.5  1.5    3
  2  5.0  1.5    3
  3  4.5  2.0    4
  4  4.0  5.0    5

Upvotes: 3

ThomasIsCoding
ThomasIsCoding

Reputation: 102241

  • Here is a base R solution using rowMeans + split.default, i.e.,
dfout <- as.data.frame(Map(rowMeans, split.default(df,factor(s <- gsub("\\..*$","",names(df)), levels = unique(s)))))

such that

> dfout
  CB_1 HC_2 HC_1
1  0.5    3  1.5
2  5.0    3  1.5
3  4.5    4  2.0
4  4.0    5  5.0
  • If you do not mind the order of column names, you can use the shorter code below
dfout <- as.data.frame(Map(rowMeans,split.default(df,gsub("\\..*$","",names(df)))))

such that

> dfout
  CB_1 HC_1 HC_2
1  0.5  1.5    3
2  5.0  1.5    3
3  4.5  2.0    4
4  4.0  5.0    5

Upvotes: 2

Matt
Matt

Reputation: 2987

A base option could be:

#find column names splitting on "."

cols <- unique(sapply(strsplit(names(df),".", fixed = T), `[`, 1))

#loop through each column name and find the rowMeans

as.data.frame(sapply(cols, function (x) rowMeans(df[grep(x, names(df))])))

  CB_1 HC_2 HC_1
1  0.5    3  1.5
2  5.0    3  1.5
3  4.5    4  2.0
4  4.0    5  5.0

Upvotes: 1

tmfmnk
tmfmnk

Reputation: 40051

One option involving dplyr and purrr could be:

map_dfc(.x = unique(sub("\\..*$", "", names(df))),
        ~ df %>%
         transmute(!!.x := rowMeans(select(., starts_with(.x)))))

  CB_1 HC_2 HC_1
1  0.5    3  1.5
2  5.0    3  1.5
3  4.5    4  2.0
4  4.0    5  5.0

Upvotes: 1

Related Questions