Reputation: 1011
Is there a way to use the apply function to every two columns in a data frame? If I have the data frame
dat <- data.frame(A=rnorm(100), B=rnorm(100),C=rnorm(100), D=rnorm(100))
A B C D
0.1511642 -0.44930197 1.821832535 2.0145395
-1.1639599 0.42685832 -0.763015835 -0.7785278
0.8430158 0.26827386 -0.004560031 0.8823789
0.7103298 0.78512673 -0.968510541 0.5172418
0.8508458 0.05809655 0.391845531 0.7452540
0.2217195 -0.06988857 0.714890499 -1.1536502
and I want the sum of each column I can use
apply(dat,2,sum)
but what if i want to apply a function over every two columns? For example
coefficients(lm(dat$A~dat$B))
coefficients(lm(dat$C~dat$D))
I have 400 columns and don't want to write this out 200 times for each pair of columns. I thought a for loop using columns j and j+1 could work but I want the relationship between column A and B, then column C and D, then column E and F and so on. Not column A and B, then column B and C, then C and D. Is there a way to do this withe apply() or another function in the apply family?
Upvotes: 1
Views: 1305
Reputation:
This is something completely different. You can create a list of formulas for each pairing based on the names. Then just iterate over each formula on the same data set.
dat <- data.frame(ID1.score1=rnorm(100), ID1.score2=rnorm(100),ID2.score1=rnorm(100), ID2.score2=rnorm(100))
ids <- unique(sub("\\..*", "", names(dat)))
f <- lapply(paste0(ids, ".score2 ~ ", ids, ".score1"), as.formula)
models <- lapply(f, function(f) lm(f, dat))
Then you can just extract or do what you want with the list of models.
model_coef <- sapply(models, coef)
colnames(model_coef) <- ids
model_coef
ID1 ID2
(Intercept) -0.07592376 -0.02472962
ID1.score1 -0.02284805 0.09144416
Upvotes: 0
Reputation: 269471
Create a grouping vector g
, split on it and lapply lm
over it.
Note that if d = data.frame(y, x)
for response y
and predictor x
then lm(d)
is the regression lm(y ~ x, d)
.
n <- ncol(dat)
g <- rep(1:n, each = 2, length = n) # 1 1 2 2
L <- lapply(split.default(dat, g), lm)
sapply(L, coef) # coefficients
sapply(L, function(x) summary(x)$r.squared) # R^2
# etc.
It could also be done over the names:
L2 <- lapply(split.default(names(dat), g), function(nms) lm(dat[nms]))
sapply(L2, coef)
or if you want nicer Call: line in the output:
reg <- function(nms, dat) do.call("lm", list(reformulate(nms[2], nms[1]), quote(dat)))
L2 <- lapply(split.default(names(dat), g), reg, dat = dat)
sapply(L2, coef)
Note that variables in lm
formulas cannot start with a digit so you may need to rename your columns if this requirement is violated. If you use the lm(dat) form then this is not a requirement but if you use a formula it is. See Note for examples.
Regarding the comment under the question about the form of the names if the names were as shown below we could alternately form g using this code:
# modify test example
s <- c("1234.score1", "1234.score2", "5678.score1", "5678.score2")
dat2 <- setNames(dat, s)
g <- cumsum(sub(".*\\D", "", names(dat2)) == 1) # 1 1 2 2
L <- lapply(split.default(dat2, g), lm)
sapply(L, coef)
or we could use this (however, this will cause the output to be sorted by g):
# modify column names
dat3 <- dat2
names(dat3) <- paste0("x", names(dat3))
g <- sub("\\..*", "", names(dat3)) # x1234 x1234 x5678 x5678
reg <- function(nms, dat) do.call("lm", list(reformulate(nms[2], nms[1]), quote(dat)))
L2 <- lapply(split.default(names(dat3), g), reg, dat = dat3)
sapply(L2, coef)
Upvotes: 4
Reputation: 5138
You could use mapply
/ Map
to repeat a function every two columns by subsetting your dataframe every two columns. Hope this helps!
Using lm
lm_list <- Map(function(y, x) summary(lm(y~x))$coefficients, dat[c(T,F)], dat[c(F,T)])
names(lm_list) <- paste0(names(dat[c(T,F)]), " ~ ", names(dat[c(F,T)]))
lm_list
$`A ~ B`
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.03566648 0.1051079 0.3393320 0.7350857
x 0.03602569 0.1162846 0.3098062 0.7573662
$`C ~ D`
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.008610382 0.1021835 -0.08426389 0.9330185
x -0.053369101 0.1171255 -0.45565742 0.6496444
Data:
set.seed(42)
dat <- data.frame(A=rnorm(100), B=rnorm(100),C=rnorm(100), D=rnorm(100))
Upvotes: 2
Reputation:
You can take advantage of the naming convention to first stack the data and then operate on the groups of common IDs. This may make things easier for future analysis.
I modified the column names per the comment.
dat <- data.frame(ID1.score1=rnorm(100), ID1.score2=rnorm(100),ID2.score1=rnorm(100), ID2.score2=rnorm(100))
library(dplyr)
library(stringr)
library(purrr)
Split the column names at ".". The first half are the IDS, the second half specify the score1 or score2 (i.e., X or Y).
cols <- str_split(names(dat), "\\.", simplify = TRUE)
ids <- unique(cols[,1])
scores <- unique(cols[,2])
Using purrr
, iterate through the IDs and select the column pair that starts with that. Add another column to this new data.frame to store the ID. Then stack all of these by rows. Now we have a "tidy" formatted dataset.
stacked_dat <- ids %>%
map_dfr(~ {
select(dat, starts_with(.)) %>%
set_names(scores) %>%
mutate(id = .x)})
Now just group on the ID column and fit the model for each ID.
fits <- stacked_dat %>%
group_by(id) %>%
do(model = lm(score1 ~ score2, data = .))
Get the model statistics like this in a list. The package broom
might help stack and clean things up, with the help of purrr
.
fits$model
Upvotes: 0