naeum
naeum

Reputation: 45

Subsetting columns in different positions and with different names in a large list of lists with purrr

I have a large list of lists. There are 46 lists in "output". Each list is a tibble with differing number of rows and columns. My immediate goal is to subset a specific column from each list.

This is str(output) of the first two lists to give you an idea of the data.

> str(output)
List of 46
 $ Brain                          :Classes ‘tbl_df’, ‘tbl’ and 'data.frame':    6108 obs. of  8 variables:
 ..$ p_val    : chr [1:6108] "0" "1.60383253411205E-274" "0" "0" ...
 ..$ avg_diff : num [1:6108] 1.71 1.7 1.68 1.6 1.58 ...
 ..$ pct.1    : num [1:6108] 0.998 0.808 0.879 0.885 0.923 0.905 0.951 0.957 0.619 0.985 ...
 ..$ pct.2    : num [1:6108] 0.677 0.227 0.273 0.323 0.36 0.384 0.401 0.444 0.152 0.539 ...
 ..$ cluster  : num [1:6108] 1 1 1 1 1 1 1 1 1 1 ...
 ..$ gene     : chr [1:6108] "Plp1" "Mal" "Ermn" "Stmn4" ...
 ..$ X__1     : logi [1:6108] NA NA NA NA NA NA ...
 ..$ Cell Type: chr [1:6108] "Myelinating oligodendrocyte" NA NA NA ...
$ Bladder                        :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 4656 obs. of  8 variables:
 ..$ p_val    : num [1:4656] 0.00 1.17e-233 2.85e-276 0.00 0.00 ...
 ..$ avg_diff : num [1:4656] 2.41 2.23 2.04 2.01 1.98 ...
 ..$ pct.1    : num [1:4656] 0.833 0.612 0.855 0.987 1 0.951 0.711 0.544 0.683 0.516 ...
 ..$ pct.2    : num [1:4656] 0.074 0.048 0.191 0.373 0.906 0.217 0.105 0.044 0.177 0.106 ...
 ..$ cluster  : num [1:4656] 1 1 1 1 1 1 1 1 1 1 ...
 ..$ gene     : chr [1:4656] "Dpt" "Gas1" "Cxcl12" "Lum" ...
 ..$ X__1     : logi [1:4656] NA NA NA NA NA NA ...
 ..$ Cell Type: chr [1:4656] "Stromal cell_Dpt high" NA NA NA ...

Since I have a large number of lists that make up the list, I have been trying to create an iterative code to perform tasks. This hasn't been successful.

  1. I can achieve this manually, or list by list, but I haven't been successful in finding an iterative way of doing this.

    x <- data.frame(output$Brain, stringsAsFactors = FALSE)
    tmp.list <- x$Cell.Type
    tmp.output <- purrr::discard(tmp.list, is.na)
    x <- subset(x, Cell.Type %in% tmp.output)
    

This gives me the output that I want, which are the rows in the column "Cell.Type" with non-NA values.

  1. I got as far as the code below to get the 8th column of each list, which is the "Cell.Type" column.

    lapply(output, "[", , 8))
    

But here I found that the naming and positioning of the "Cell.Type" column in each list is not consistent. This means I cannot use the lapply function to subset the 8th columns, as some lists have this on for example the 9th column.

  1. I tried the code below, but it does not work and gets an error.

    lapply(output, "[", , c('Cell.Type', 'celltyppe'))
    #Error: Column `celltyppe` not found
    #Call `rlang::last_error()` to see a backtrace
    

Essentially, from my "output" list, I want to subset either columns "Cell.Type" or "celltyppe" from each of the 46 lists to create a new list with 46 lists of just a single column of values. Then I want to drop all rows with NA.

I would like to perform this using some sort of loop.

At the moment I have not had much success. Lapply seems to be able to extract columns through lists iterately, and I am having difficultly trying to subset names columns.

Once I can do this, I then want to create a loop that can subset only rows without NA.


FINAL CODE

This is the final code I have used to create exactly what I had hoped for. The first line of the code specifies the loop to go through each list of the large list. The second line of code selects columns of each list that contains "ell" in its name (Cell type, Cell Type, or celltyppe). The last removes any rows with "na".

    purrr::map(output, ~ .x %>% 
        dplyr::select(matches("ell")) %>% 
             na.omit)

Upvotes: 1

Views: 63

Answers (1)

akrun
akrun

Reputation: 887531

We can use anonymous function call

lapply(output, function(x) na.omit(x[grep("(?i)Cell\\.?(?i)Typp?e", names(x))]))
#[[1]]
#  Cell.Type
#1         1
#2         2
#3         3
#4         4
#5         5

#[[2]]
#  celltyppe
#1         7
#2         8
#3         9
#4        10
#5        11

Also with purrr

library(tidyverse)
map(output, ~ .x %>%
               select(matches("(?i)Cell\\.?(?i)Typp?e") %>%
               na.omit))

data

output <- list(data.frame(Cell.Type = 1:5, col1 = 6:10, col2 = 11:15), 
          data.frame(coln = 1:5, celltyppe = 7:11))

Upvotes: 1

Related Questions