How to select and apply a function to multiple column within a list of dataframes

Question

I am using a list of two dataframes that share several similar columns, and I want to be able to convert the class of several columns in each dataframe in one shot using their column names and not the column position.

I’ve searched StockOverflow and found similar questions here: here:Using lists to change columns in multiple dataframes in R and here: applying a function for a list of dataframes. However, I am stuck trying to use multiple column names to convert the dates. Here is a sample data to illustrate my problem:

df1 <- data.frame("t1" = c(20070103, 20070104, 20070105, 20070108, 20070109), "t2" = c(20070110,20070111, 20070112, 20070113, 20070114), A = 1:5)
df2 <- data.frame("t1" = c(20080103, 20080104, 20080105, 20080108, 20080109), "t2" = c(20080110,20080111, 20080112, 20080113, 20080114), B = 1:5)
l <- list(df1 = df1, df2=df2)

So far I’ve found two solutions which I can repeat for every column I want to convert to a date:

#1
l2 <-lapply(l, function(x) transform(x, t1 = as.Date(as.character(t1), "%Y%m%d")))

#2
f <- function(df){
    within(df, t1 <- as.date(date))
}
l2 <- lapply(l, f)

However, is there way I can use either method to get multiple columns (not the entire dataframe or list) in one shot and by using column names? I’ve tried the following codes to no avail:

periods <- c( "t1", "t2" )
ls2 <-lapply(ls, function(x) transform(x, periods = as.Date(as.character(periods), "%Y%m%d")) 

f <- function(df) {
     within(df, t1 <- as.Date(as.character(t1), "%Y%m%d"))
     within(df, t2 <- as.Date(as.character(t2), "%Y%m%d"))
         }
l2 <- lapply(l, f)

for (i in periods)
    l2 <-lapply(l, function(x) transform(x, i = as.Date(as.character(i), "%Y%m%d")))

r2evans · Accepted Answer

Suggestion #1, simple:
```
lapply(l, function(dfrm, periods, fmt) {
    for (ff in which(colnames(dfrm) %in% periods))
        dfrm[,ff] <- as.Date(as.character(dfrm[,ff]), fmt)
    dfrm
}, periods=c('t1', 't2'), fmt='%Y%m%d')
```
Using ff in which(...) allows us to specify column headers that may or may not be included, no change done if some or all of them are vacant in a specific data.frame.

The second and third arguments to lapply, periods=c('t1','t2'), allows you to specify the format and column names and (cleanly) bring them into the inner loops (without having the inside of the loops reach outside for data, something that will bite you if/when you copy/paste code into a different project).
Suggestion #2, try to convert all columns:
```
lapply(l, function(dfrm, fmt) {
    for (cc in seq.int(ncol(dfrm)))
        if (! is.na(as.Date(as.character(dfrm[1,cc]), format=fmt)))
            dfrm[,cc] <- as.Date(as.character(dfrm[,cc]), format=fmt)
    dfrm
}, fmt='%Y%m%d')
```
This can fail if you have other columns that could be inferred as dates (using these heuristics) but aren't intended as such.

I limit the check to the first row for performance, in case large amounts of data would cause this to be a performance bottleneck.
Suggestion #3, same thing, but more robust to false-alarms:
```
lapply(l, function(dfrm, fmt) {
    for (cc in seq.int(ncol(dfrm))) {
        tmp <- as.Date(as.character(dfrm[,cc]), format=fmt)
        if (! any(is.na(tmp))) dfrm[,cc] <- tmp
    }
    dfrm
}, fmt='%Y%m%d')
```
Alright, we've reduced the number of false-alarms by checking to make certain all values converted to a date, but this means that if any one cell fails in an otherwise valid column of dates, then the whole column suffers. You can get around this perhaps by checking the number of percentage of fails, but now we're getting a bit ridiculous ...

Suggestion #4, using regular expressions on the column names:

lapply(l, function(dfrm, regex, fmt) {
    for (cc in grep(regex, colnames(dfrm)))
        dfrm[,cc] <- as.Date(as.character(dfrm[,cc]), format=fmt)
    dfrm
}, regex='^t[0-9]+$', fmt='%Y%m%d')

This may spark other questions if you aren't comfortable with regular expressions.

These could have been done with a nested *apply instead of a for loop, but since R is now performing quite well with loops like this, I don't think it's a big concern. (It will depend on the size of your data.)

If you're more comfortable with the naming convention for the column headers, then #4 might be your answer. If not (or you aren't comfortable with regular expressions) but you are confident that the non-date columns will not be mis-construed, then #2 or #3 work well, too.

How to select and apply a function to multiple column within a list of dataframes

Answers (2)

Related Questions