gkohrell
gkohrell

Reputation: 42

Performing multiple two sample t-tests on two lists of data frames

I have two lists with four data frames each. The data frames in the first list ("loc_list_OBS") have only two columns "Year" and "Mean_Precip" while the data frames in the second list ("loc_list_future") have 33 columns "Year" and then mean precipitation values for 32 different models.

So the data frames in loc_list_OBS look like this but the data goes until Year 2005:

Year     Mean_Precip
1950    799.1309
1951    748.0239
1952    619.7572
1953    799.9263
1954    680.9194
1955    766.2304
1956    599.5365
1957    717.8912
1958    739.4901
1959    707.1130
...     ....
2005    ....

And the data frames in loc_list_future look like this but with 32 Model columns total and the data goes to Year 2059:

Year   Model 1      Model 2      Model 3    ...... Model 32
2020    714.1101    686.5888    1048.4274
2021    1018.0095    766.9161     514.2700
2022    756.7066    902.2542     906.2877
2023    906.9675    919.5234     647.6630
2024    767.4008    861.1275     700.2612
2025    876.1538    738.8370     664.3342
2026    781.5092    801.2387     743.8965
2027    876.3522    819.4323     675.3022
2028    626.9468    927.0774     696.1884
2029    752.4084    824.7682     835.1566
....    .....       .....         .....
2059    .....       .....         .....

Each data frame represents a geographic location, and the two lists have the same four locations but one list is for observed values and the other is for predicted future values.

I would like to run two sample t-tests that compare the observed values with the predicted future values for each model at each location. Put another way, I want to compare the first data frame in each list, then the second data frame in each list, and the same with the third and fourth data frames.

Here is the code I have used:

t_stat = NULL
mapply(FUN = function(f, o) {
 t_stat <- t.test(o$Mean_Precip, f, alternative = "two.sided")  
}, f = loc_list_ttest, o = loc_list_OBS, SIMPLIFY = FALSE)
t_stat

This code only gives me four t-test outputs that are comparing the "Mean_Precip" columns in the observed data with what appears to be a combination of all the models in the future data. However I need a t-test for each model at each location. Can anyone figure out how to do this?

Upvotes: 1

Views: 810

Answers (2)

dcarlson
dcarlson

Reputation: 11056

Here is a way of doing what you want although if the projections were based on the observations, the validity of the p-values is suspect because the two "samples" are not independent.

results <- lapply(1:4, function(y) lapply(loc_list_future[[y]][, -1],
      function(x) t.test(loc_list_OBS[[y]], x)))
names(results) <- c("Region 1", "Region 2", "Region 3", "Region 4")

results will be a list containing four lists, one for each region. Within each region list will be a list for each model. results[[1]] gives you the results for all models in region 1 and results[[1]][[1]] gives you the results for region 1 model 1.

Upvotes: 0

Duck
Duck

Reputation: 39595

You can tackle the issue with an approach like this. I understood that you want to compare each dataframe with other and obtain a t-test for each variable across second dataframe. One approach is to create a function to loop across the variables in second dataframe and then save the results in a list. You will have four list and inside each of them all the t-test. I have created dummy data based on what you shared:

#Data
df <- structure(list(Year = c(1950L, 1951L, 1952L, 1953L, 1954L, 1955L, 
1956L, 1957L, 1958L, 1959L, 2005L), Mean_Precip = c(799.1309, 
748.0239, 619.7572, 799.9263, 680.9194, 766.2304, 599.5365, 717.8912, 
739.4901, 707.113, 707.113)), class = "data.frame", row.names = c(NA, 
-11L))
#Data2
df1 <- structure(list(Year = c(2020L, 2021L, 2022L, 2023L, 2024L, 2025L, 
2026L, 2027L, 2028L, 2029L, 2059L), Model.1 = c(714.1101, 1018.0095, 
756.7066, 906.9675, 767.4008, 876.1538, 781.5092, 876.3522, 626.9468, 
752.4084, 752.4084), Model.2 = c(686.5888, 766.9161, 902.2542, 
919.5234, 861.1275, 738.837, 801.2387, 819.4323, 927.0774, 824.7682, 
824.7682), Model.3 = c(1048.4274, 514.27, 906.2877, 647.663, 
700.2612, 664.3342, 743.8965, 675.3022, 696.1884, 835.1566, 835.1566
)), class = "data.frame", row.names = c(NA, -11L))

Now, we will create the lists (you must have them):

#Lists
List1 <- list(df1=df,df2=df,df3=df,df4=df)
List2 <- list(df1=df1,df2=df1,df3=df1,df4=df1)

Here is the function:

#Function
myfun <- function(x,y)
{
  l <- x$Mean_Precip
  #Empty list
  List <- list()
  #Now loop
  for(i in 2:dim(y)[2])
  {
    #Label
    val <- names(y[,i,drop=F])
    r <- y[,i]
    #Test
    test <- t.test(l, r, alternative = "two.sided") 
    #Save
    List[[i-1]] <- test
    names(List)[i-1] <- val
  }
  return(List)
}

Finally, we apply:

#Apply
t.stat <- mapply(FUN = myfun,x=List1,y=List2,SIMPLIFY = FALSE)

The output is a list of lists and you can explore each element as next:

t.stat[[1]]

Where you will find the results from comparing first dataframe against all the variables from the second dataframe:

Output:

$Model.1

    Welch Two Sample t-test

data:  l and r
t = -2.2645, df = 16.448, p-value = 0.03738
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -165.949710   -5.657818
sample estimates:
mean of x mean of y 
 716.8302  802.6339 


$Model.2

    Welch Two Sample t-test

data:  l and r
t = -3.5901, df = 19.56, p-value = 0.001881
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -170.75516  -45.13574
sample estimates:
mean of x mean of y 
 716.8302  824.7756 


$Model.3

    Welch Two Sample t-test

data:  l and r
t = -0.72149, df = 13.829, p-value = 0.4826
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -138.01368   68.59334
sample estimates:
mean of x mean of y 
 716.8302  751.5403 

Upvotes: 3

Related Questions