user2113499
user2113499

Reputation: 1011

applying function to every two columns in R

Is there a way to use the apply function to every two columns in a data frame? If I have the data frame

dat <- data.frame(A=rnorm(100), B=rnorm(100),C=rnorm(100), D=rnorm(100))

A           B            C          D
0.1511642 -0.44930197  1.821832535  2.0145395
-1.1639599  0.42685832 -0.763015835 -0.7785278
0.8430158  0.26827386 -0.004560031  0.8823789
0.7103298  0.78512673 -0.968510541  0.5172418
0.8508458  0.05809655  0.391845531  0.7452540
0.2217195 -0.06988857  0.714890499 -1.1536502

and I want the sum of each column I can use

apply(dat,2,sum)

but what if i want to apply a function over every two columns? For example

coefficients(lm(dat$A~dat$B))
coefficients(lm(dat$C~dat$D))

I have 400 columns and don't want to write this out 200 times for each pair of columns. I thought a for loop using columns j and j+1 could work but I want the relationship between column A and B, then column C and D, then column E and F and so on. Not column A and B, then column B and C, then C and D. Is there a way to do this withe apply() or another function in the apply family?

Upvotes: 1

Views: 1305

Answers (4)

user10917479
user10917479

Reputation:

This is something completely different. You can create a list of formulas for each pairing based on the names. Then just iterate over each formula on the same data set.

dat <- data.frame(ID1.score1=rnorm(100), ID1.score2=rnorm(100),ID2.score1=rnorm(100), ID2.score2=rnorm(100))

ids <- unique(sub("\\..*", "", names(dat)))
f <- lapply(paste0(ids, ".score2 ~ ", ids, ".score1"), as.formula)

models <- lapply(f, function(f) lm(f, dat))

Then you can just extract or do what you want with the list of models.

model_coef <- sapply(models, coef)
colnames(model_coef) <- ids

model_coef

                    ID1         ID2
(Intercept) -0.07592376 -0.02472962
ID1.score1  -0.02284805  0.09144416

Upvotes: 0

G. Grothendieck
G. Grothendieck

Reputation: 269471

Create a grouping vector g, split on it and lapply lm over it.

Note that if d = data.frame(y, x) for response y and predictor x then lm(d) is the regression lm(y ~ x, d) .

n <- ncol(dat)
g <- rep(1:n, each = 2, length = n) # 1 1 2 2 
L <- lapply(split.default(dat, g), lm)

sapply(L, coef) # coefficients
sapply(L, function(x) summary(x)$r.squared) # R^2
# etc.

It could also be done over the names:

L2 <- lapply(split.default(names(dat), g), function(nms) lm(dat[nms]))
sapply(L2, coef)

or if you want nicer Call: line in the output:

reg <- function(nms, dat) do.call("lm", list(reformulate(nms[2], nms[1]), quote(dat)))
L2 <- lapply(split.default(names(dat), g), reg, dat = dat)
sapply(L2, coef)

Note that variables in lm formulas cannot start with a digit so you may need to rename your columns if this requirement is violated. If you use the lm(dat) form then this is not a requirement but if you use a formula it is. See Note for examples.

Note

Regarding the comment under the question about the form of the names if the names were as shown below we could alternately form g using this code:

# modify test example
s <- c("1234.score1", "1234.score2", "5678.score1", "5678.score2")
dat2 <- setNames(dat, s)

g <- cumsum(sub(".*\\D", "", names(dat2)) == 1)  # 1 1 2 2
L <- lapply(split.default(dat2, g), lm)
sapply(L, coef)

or we could use this (however, this will cause the output to be sorted by g):

# modify column names
dat3 <- dat2
names(dat3) <- paste0("x", names(dat3))

g <- sub("\\..*", "", names(dat3)) # x1234 x1234 x5678 x5678
reg <- function(nms, dat) do.call("lm", list(reformulate(nms[2], nms[1]), quote(dat)))
L2 <- lapply(split.default(names(dat3), g), reg, dat = dat3)
sapply(L2, coef)

Upvotes: 4

Andrew
Andrew

Reputation: 5138

You could use mapply / Map to repeat a function every two columns by subsetting your dataframe every two columns. Hope this helps!

Using lm

lm_list <- Map(function(y, x) summary(lm(y~x))$coefficients, dat[c(T,F)], dat[c(F,T)])
names(lm_list) <- paste0(names(dat[c(T,F)]), " ~ ", names(dat[c(F,T)]))
lm_list

$`A ~ B`
              Estimate Std. Error   t value  Pr(>|t|)
(Intercept) 0.03566648  0.1051079 0.3393320 0.7350857
x           0.03602569  0.1162846 0.3098062 0.7573662

$`C ~ D`
                Estimate Std. Error     t value  Pr(>|t|)
(Intercept) -0.008610382  0.1021835 -0.08426389 0.9330185
x           -0.053369101  0.1171255 -0.45565742 0.6496444

Data:

set.seed(42)
dat <- data.frame(A=rnorm(100), B=rnorm(100),C=rnorm(100), D=rnorm(100))

Upvotes: 2

user10917479
user10917479

Reputation:

You can take advantage of the naming convention to first stack the data and then operate on the groups of common IDs. This may make things easier for future analysis.

I modified the column names per the comment.

dat <- data.frame(ID1.score1=rnorm(100), ID1.score2=rnorm(100),ID2.score1=rnorm(100), ID2.score2=rnorm(100))

library(dplyr)
library(stringr)
library(purrr)

Split the column names at ".". The first half are the IDS, the second half specify the score1 or score2 (i.e., X or Y).

cols <- str_split(names(dat), "\\.", simplify = TRUE)
ids <- unique(cols[,1])
scores <- unique(cols[,2])

Using purrr, iterate through the IDs and select the column pair that starts with that. Add another column to this new data.frame to store the ID. Then stack all of these by rows. Now we have a "tidy" formatted dataset.

stacked_dat <- ids %>%
  map_dfr(~ {
    select(dat, starts_with(.)) %>%
      set_names(scores) %>%
      mutate(id = .x)})

Now just group on the ID column and fit the model for each ID.

fits <- stacked_dat %>%
  group_by(id) %>%
  do(model = lm(score1 ~ score2, data = .))

Get the model statistics like this in a list. The package broom might help stack and clean things up, with the help of purrr.

fits$model

Upvotes: 0

Related Questions