Reputation: 146
I am using a list of two dataframes that share several similar columns, and I want to be able to convert the class of several columns in each dataframe in one shot using their column names and not the column position.
I’ve searched StockOverflow and found similar questions here: here:Using lists to change columns in multiple dataframes in R and here: applying a function for a list of dataframes. However, I am stuck trying to use multiple column names to convert the dates. Here is a sample data to illustrate my problem:
df1 <- data.frame("t1" = c(20070103, 20070104, 20070105, 20070108, 20070109), "t2" = c(20070110,20070111, 20070112, 20070113, 20070114), A = 1:5)
df2 <- data.frame("t1" = c(20080103, 20080104, 20080105, 20080108, 20080109), "t2" = c(20080110,20080111, 20080112, 20080113, 20080114), B = 1:5)
l <- list(df1 = df1, df2=df2)
So far I’ve found two solutions which I can repeat for every column I want to convert to a date:
#1
l2 <-lapply(l, function(x) transform(x, t1 = as.Date(as.character(t1), "%Y%m%d")))
#2
f <- function(df){
within(df, t1 <- as.date(date))
}
l2 <- lapply(l, f)
However, is there way I can use either method to get multiple columns (not the entire dataframe or list) in one shot and by using column names? I’ve tried the following codes to no avail:
periods <- c( "t1", "t2" )
ls2 <-lapply(ls, function(x) transform(x, periods = as.Date(as.character(periods), "%Y%m%d"))
f <- function(df) {
within(df, t1 <- as.Date(as.character(t1), "%Y%m%d"))
within(df, t2 <- as.Date(as.character(t2), "%Y%m%d"))
}
l2 <- lapply(l, f)
for (i in periods)
l2 <-lapply(l, function(x) transform(x, i = as.Date(as.character(i), "%Y%m%d")))
Upvotes: 2
Views: 2337
Reputation: 52637
l.new <- lapply(l, function(x) {x[periods] <- lapply(x[periods], as.character); x})
str(l.new)
produces
List of 2
$ df1:'data.frame': 5 obs. of 3 variables:
..$ t1: chr [1:5] "20070103" "20070104" "20070105" "20070108" ...
..$ t2: chr [1:5] "20070110" "20070111" "20070112" "20070113" ...
..$ A : int [1:5] 1 2 3 4 5
$ df2:'data.frame': 5 obs. of 3 variables:
..$ t1: chr [1:5] "20080103" "20080104" "20080105" "20080108" ...
..$ t2: chr [1:5] "20080110" "20080111" "20080112" "20080113" ...
..$ B : int [1:5] 1 2 3 4 5
UPDATE: In order to get the dates, you can use:
lapply(
l,
function(x) {
x[periods] <- lapply(x[periods], function(x) as.Date(as.character(x), format="%Y%m%d"));
x
} )
Upvotes: 1
Reputation: 160407
Suggestion #1, simple:
lapply(l, function(dfrm, periods, fmt) {
for (ff in which(colnames(dfrm) %in% periods))
dfrm[,ff] <- as.Date(as.character(dfrm[,ff]), fmt)
dfrm
}, periods=c('t1', 't2'), fmt='%Y%m%d')
Using ff in which(...)
allows us to specify column headers that
may or may not be included, no change done if some or all of them
are vacant in a specific data.frame.
The second and third arguments to lapply, periods=c('t1','t2')
,
allows you to specify the format and column names and (cleanly)
bring them into the inner loops (without having the inside of the
loops reach outside for data, something that will bite you if/when
you copy/paste code into a different project).
Suggestion #2, try to convert all columns:
lapply(l, function(dfrm, fmt) {
for (cc in seq.int(ncol(dfrm)))
if (! is.na(as.Date(as.character(dfrm[1,cc]), format=fmt)))
dfrm[,cc] <- as.Date(as.character(dfrm[,cc]), format=fmt)
dfrm
}, fmt='%Y%m%d')
This can fail if you have other columns that could be inferred as dates (using these heuristics) but aren't intended as such.
I limit the check to the first row for performance, in case large amounts of data would cause this to be a performance bottleneck.
Suggestion #3, same thing, but more robust to false-alarms:
lapply(l, function(dfrm, fmt) {
for (cc in seq.int(ncol(dfrm))) {
tmp <- as.Date(as.character(dfrm[,cc]), format=fmt)
if (! any(is.na(tmp))) dfrm[,cc] <- tmp
}
dfrm
}, fmt='%Y%m%d')
Alright, we've reduced the number of false-alarms by checking to make certain all values converted to a date, but this means that if any one cell fails in an otherwise valid column of dates, then the whole column suffers. You can get around this perhaps by checking the number of percentage of fails, but now we're getting a bit ridiculous ...
Suggestion #4, using regular expressions on the column names:
lapply(l, function(dfrm, regex, fmt) {
for (cc in grep(regex, colnames(dfrm)))
dfrm[,cc] <- as.Date(as.character(dfrm[,cc]), format=fmt)
dfrm
}, regex='^t[0-9]+$', fmt='%Y%m%d')
This may spark other questions if you aren't comfortable with regular expressions.
These could have been done with a nested *apply
instead of a for
loop, but since R is now performing quite well with loops like this, I
don't think it's a big concern. (It will depend on the size of your
data.)
If you're more comfortable with the naming convention for the column headers, then #4 might be your answer. If not (or you aren't comfortable with regular expressions) but you are confident that the non-date columns will not be mis-construed, then #2 or #3 work well, too.
Upvotes: 3