Reputation: 447
I have a dataframe that looks like this:
W01 0.750000 0.916667 0.642857 1.000000 0.619565
W02 0.880000 0.944444 0.500000 0.991228 0.675439
W03 0.729167 0.900000 0.444444 1.000000 0.611111
W04 0.809524 0.869565 0.500000 1.000000 0.709091
W05 0.625000 0.925926 0.653846 1.000000 0.589286
Variation 1_941119_A/G 1_942335_C/G 1_942451_T/C 1_942934_G/C \
W01 0.967391 0.965909 1 0.130435
W02 0.929825 0.937500 1 0.184211
W03 0.925926 0.880000 1 0.138889
W04 0.918182 0.907407 1 0.200000
W05 0.901786 0.858491 1 0.178571
Variation 1_944296_G/A ... X_155545046_C/T X_155774775_G/T \
W01 0.978261 ... 0.652174 0.641304
W02 0.938596 ... 0.728070 0.736842
W03 0.944444 ... 0.675926 0.685185
W04 0.927273 ... 0.800000 0.690909
W05 0.901786 ... 0.794643 0.705357
Variation Y_5100327_G/T Y_5100614_T/G Y_12786160_G/A Y_12914512_C/A \
W01 0.807692 0.800000 0.730769 0.807692
W02 0.655172 0.653846 0.551724 0.666667
W03 0.880000 0.909091 0.833333 0.916667
W04 0.666667 0.642857 0.580645 0.678571
W05 0.730769 0.720000 0.692308 0.720000
Variation Y_13470103_G/A Y_19705901_A/G Y_20587967_A/C mean_age
W01 0.807692 0.666667 0.333333 56.3
W02 0.678571 0.520000 0.250000 66.3
W03 0.916667 0.764706 0.291667 69.7
W04 0.666667 0.560000 0.322581 71.6
W05 0.703704 0.600000 0.346154 72.5
[5 rows x 67000 columns]
I would like to fit a simple Least squares linear regression and Thiel-Sen linear regression for each column as an independent variable and mean-age as the response variable and gather summary statistics including the slope
, intercept
, r value
, p value
and std err
for each fit and preferably gathers the outputs as a datafarme!
So far, I have been slicing my 'df' and carrying out regression analysis for each column separately:
from scipy import stats
import time
# Start timer
start_time = time.time()
# Select only 'Variation of interest' and 'mean_age' columns
r1 = tdf [['1_944296_G/A', 'mean_age']]
# Use scipy lingress function to perform linear regression
slope, intercept, r_value, p_value, std_err = stats.linregress(tdf['mean_age'], \
tdf['1_69270_A/G'])
print('The p-value between the 2 variables is measured as ' + str(p_value) + '\n')
print('Least squares linear model coefficients, intercept = ' + str(intercept) + \
'. Slope = ' + str(slope)+'\n')
# Create regression line
regressLine = intercept + tdf['mean_age']*slope
# Regression using Theil-Sen with 95% confidence intervals
res = stats.theilslopes(tdf['1_69270_A/G'], tdf['mean_age'], 0.95)
print('Thiel-Sen linear model coefficients, intercept = ' + str(res[1]) + '. Slope = ' + \
str(res[0]) +'\n')
# Scatter plot the temperature
plt.clf()
plt.scatter(tdf['mean_age'], tdf['1_69270_A/G'], s = 3, label = 'Allele frequency')
# Add least squares regression line
plt.plot(tdf['mean_age'], regressLine, label = 'Least squares regression line');
# Add Theil-Sen regression line
plt.plot(tdf['mean_age'], res[1] + res[0] * tdf['mean_age'], 'r-', label = 'Theil-Sen regression line')
# Add Theil-Sen confidence intervals
plt.plot(tdf['mean_age'], res[1] + res[2] * tdf['mean_age'], 'r--', label = 'Theil-Sen 95% confidence interval')
plt.plot(tdf['mean_age'], res[1] + res[3] * tdf['mean_age'], 'r--')
# Add legend, axis limits and save to png
plt.legend(loc = 'upper left')
#plt.ylim(7,14); plt.xlim(1755, 2016)
plt.xlabel('Year'); plt.ylabel('Temperature (C)')
plt.savefig('pythonRegress.png')
# End timer
end_time = time.time()
print('Elapsed time = ' + str(end_time - start_time) + ' seconds')
I was wondering how I could carry out this analysis in an iterative loop for each column and gather the final results in a comprehensive dataframe.
I have seen [this](Looping regression and obtaining summary statistics in matrix form"Looping regression and obtaining summary statistics in matrix form ")! but not quite the output I expect. Any solution in Python or R is appreciated!
Upvotes: 1
Views: 1287
Reputation: 1111
I do not know the exact detail and complexity of your data and analysis, but this is the approach I would take.
data <- data.frame(mean_age=rnorm(5),
Column_1=rnorm(5),
Column_2=rnorm(5),
Column_3=rnorm(5),
Column_4=rnorm(5),
Column_5=rnorm(5)
)
data
looped <- list()
for(each_col in names(data)[-1]){
looped[[each_col]] <- lm(get(each_col) ~ mean_age, data)
}
looped
Upvotes: 1
Reputation: 33782
I think you will find this guide useful: Running a model on separate groups.
Let's generate some example data similar to yours, with values for two variants and mean age. We also need a few packages:
library(dplyr)
library(tidyr)
library(purrr)
library(broom)
set.seed(1001)
data1 <- data.frame(mean_age = sample(40:80, 50, replace = TRUE),
snp01 = rnorm(50),
snp02 = rnorm(50))
The first step is to transform from "wide" to "long" format using gather
, so as variant names are in one column and values in another. Then we can nest
by variant name.
data1 %>%
gather(snp, value, -mean_age) %>%
nest(-snp)
This creates a tibble (a special data frame) where the second column, data
is a "list column" - it contains mean ages and the values for the variant in that row:
# A tibble: 2 x 2
snp data
<chr> <list>
1 snp01 <tibble [50 x 2]>
2 snp02 <tibble [50 x 2]>
Now we use purrr::map
to create a third column with the linear model for each row:
data1 %>%
gather(snp, value, -mean_age) %>%
nest(-snp) %>%
mutate(model = map(data, ~lm(mean_age ~ value, data = .)))
Result:
# A tibble: 2 x 3
snp data model
<chr> <list> <list>
1 snp01 <tibble [50 x 2]> <lm>
2 snp02 <tibble [50 x 2]> <lm>
The last step is to summarise the models as desired, then unnest
the data structure. I'm using broom::glance()
. The full procedure:
data1 %>%
gather(snp, value, -mean_age) %>%
nest(-snp) %>%
mutate(model = map(data, ~lm(mean_age ~ value, data = .)),
summary = map(model, glance)) %>%
select(-data, -model) %>%
unnest(summary)
Result:
# A tibble: 2 x 12
snp r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int>
1 snp01 0.00732 -0.0134 12.0 0.354 0.555 2 -194. 394. 400. 6901. 48
2 snp02 0.0108 -0.00981 12.0 0.524 0.473 2 -194. 394. 400. 6877. 48
Upvotes: 2