user2943039
user2943039

Reputation: 73

Remove NA columns in a list of dataframes

I am having some trouble cleaning data that I imported from Excel with readxl. readxl created a large list of objects with classes = c('data.frame', tbl_df, tbl) (I would also like to know about why/how it has multiple classes assigned to it). Each of those objects is one of the sheets in the original Excel workbook. The problem is that each of those objects (sheets) may have many columns entirely filled with NAs. I have scanned through stackoverflow and found some similar problems and tried to apply the given solutions like here and here (the first one is the most like my problem). However when I try this:

lapply(x, function(y) y[, !is.na(y)])

I get the following error:

Error in `[.data.frame`(y, , !is.na(y)) : undefined columns selected

I've also tried this:

lapply(x, function(y) y[!is.na(y)]

but it reduces all of my dataframes to only the first column. I think I know it's something to do with my dataframe-within-list syntax. I've experimented with different iterations of y[[]][] and even recently found this interesting pattern in lapply: lapply(x, "[[", y), but couldn't make it work.

Here are the first two objects in my list of dataframes (any hints on how to be more efficient in dput-ing this data are also appreciated). As you can see, the first object has no NA columns, whereas the second has 5 NA columns. I would like to remove those 5 NA columns, but do so for all objects in my list.

Any help is greatly appreciated!

dput(head(x[[1]]))
structure(list(Date = structure(c(1305504000, 1305504000, 1305504000, 
1305504000, 1305504000, 1305504000), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), Time = structure(c(-2209121912, -2209121612, 
-2209121312, -2209121012, -2209120712, -2209120412), class = c("POSIXct", 
"POSIXt"), tzone = "UTC"), Level = c(106.9038, 106.9059, 106.89, 
106.9121, 106.8522, 106.8813), Temperature = c(6.176, 6.173, 
6.172, 6.168, 6.166, 6.165)), .Names = c("Date", "Time", "Level", 
"Temperature"), row.names = c(NA, 6L), class = c("tbl_df", "tbl", 
"data.frame"))

dput(head(x[[2]]))
structure(list(Date = structure(c(1305504000, 1305504000, 1305504000, 
1305504000, 1305504000, 1305504000), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), Time = structure(c(-2209121988, -2209121688, 
-2209121388, -2209121088, -2209120788, -2209120488), class = c("POSIXct", 
"POSIXt"), tzone = "UTC"), LEVEL = c(117.5149, 117.511, 117.5031, 
117.5272, 117.4523, 117.4524), TEMPERATURE = c(5.661, 5.651, 
5.645, 5.644, 5.644, 5.645), `NA` = c(NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_), `NA` = c(NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_), `NA` = c(NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_), `NA` = c(NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_), `NA` = c(NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_)), .Names = c("Date", "Time", "LEVEL", 
"TEMPERATURE", NA, NA, NA, NA, NA), row.names = c(NA, 6L), class =    
c("tbl_df", "tbl", "data.frame"))

Upvotes: 1

Views: 1472

Answers (1)

DunderChief
DunderChief

Reputation: 726

How about this:

lapply(df_list, function(df) df[, colSums(is.na(df)) == 0])

Or maybe:

lapply(df_list, function(df) df[, colSums(is.na(df)) < nrow(df)])

if you want to allow some, but not all rows to be NA

Upvotes: 1

Related Questions