Steve
Steve

Reputation: 333

R data.table, select columns with no NA

I have a table of stock prices here:

https://drive.google.com/file/d/1S666wiCzf-8MfgugN3IZOqCiM7tNPFh9/view?usp=sharing

Some columns have NA's because the company does not exist (until later dates), or the company folded.

What I want to do is: select columns that has no NA's. I use data.table because it is faster. Here are my working codes:

example <- fread(file = "example.csv", key = "date")
example_select <- example[, 
                          lapply(.SD, 
                                 function(x) not(sum(is.na(x) > 0)))
                          ] %>%
  as.logical(.)
example[, ..example_select]

Is there better (less lines) code to do the same? Thank you!

Upvotes: 6

Views: 1581

Answers (5)

Marek
Marek

Reputation: 50773

In the ?data.table help page, about the .SDcols argument you can find that you can use the predicate function or its negation, so simplest solution would be:

example[, .SD, .SDcols = !anyNA]

Upvotes: 5

M&#229;nsT
M&#229;nsT

Reputation: 974

There are lots of ways you could do this. Here's how I usually do it - a data.table approach without lapply:

example[, .SD, .SDcols = colSums(is.na(example)) == 0]

Upvotes: 4

Ronak Shah
Ronak Shah

Reputation: 389275

Using Filter :

library(data.table)
Filter(function(x) all(!is.na(x)), fread('example.csv'))

#            date   ACU   ACY   AE  AEF     AIM   AIRI  AMS   APT
#   1: 2001-01-02  2.75  4.75 14.4 8.44 2376.00 250.00 2.50  1.06
#   2: 2001-01-03  2.75  4.50 14.5 9.00 2409.00 250.00 2.50  1.12
#   3: 2001-01-04  2.75  4.50 14.1 8.88 2508.00 250.00 2.50  1.06
#   4: 2001-01-05  2.38  4.50 14.1 8.88 2475.00 250.00 2.25  1.12
#   5: 2001-01-08  2.56  4.75 14.3 8.75 2376.00 250.00 2.38  1.06
#  ---                                                           
#5072: 2021-03-02 36.95 10.59 28.1 8.77    2.34   1.61 2.48 14.33
#5073: 2021-03-03 38.40 10.00 30.1 8.78    2.26   1.57 2.47 12.92
#5074: 2021-03-04 37.90  8.03 30.8 8.63    2.09   1.44 2.27 12.44
#5075: 2021-03-05 35.68  8.13 31.5 8.70    2.05   1.48 2.35 12.45
#5076: 2021-03-08 37.87  8.22 31.9 8.59    2.01   1.52 2.47 12.15

#          ARMP   ASXC
#   1: 4.90e+05 178.75
#   2: 4.72e+05 192.97
#   3: 5.42e+05 300.62
#   4: 5.86e+05 300.62
#   5: 6.39e+05 276.25
#  ---                
#5072: 5.67e+00   3.92
#5073: 5.58e+00   4.54
#5074: 5.15e+00   4.08
#5075: 4.49e+00   3.81
#5076: 4.73e+00   4.15

Upvotes: 0

Sinh Nguyen
Sinh Nguyen

Reputation: 4497

An answer using tidyverse packages

library(readr)
library(dplyr)
library(purrr)

data <- read_csv("~/Downloads/example.csv")
map2_dfc(data, names(data), .f = function(x, y) { 
  column <- tibble("{y}" := x)
  if(any(is.na(column)))
    return(NULL)
  else
    return(column)
  })

Output

# A tibble: 5,076 x 11
   date         ACU   ACY    AE   AEF   AIM  AIRI   AMS   APT   ARMP  ASXC
   <date>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>
 1 2001-01-02  2.75  4.75  14.4  8.44  2376   250  2.5   1.06 490000  179.
 2 2001-01-03  2.75  4.5   14.5  9     2409   250  2.5   1.12 472500  193.
 3 2001-01-04  2.75  4.5   14.1  8.88  2508   250  2.5   1.06 542500  301.
 4 2001-01-05  2.38  4.5   14.1  8.88  2475   250  2.25  1.12 586250  301.
 5 2001-01-08  2.56  4.75  14.3  8.75  2376   250  2.38  1.06 638750  276.
 6 2001-01-09  2.56  4.75  14.3  8.88  2409   250  2.38  1.06 568750  264.
 7 2001-01-10  2.56  5.5   14.5  8.69  2310   300  2.12  1.12 586250  274.
 8 2001-01-11  2.69  5.25  14.4  8.69  2310   300  2.25  1.19 564375  333.
 9 2001-01-12  2.75  4.81  14.6  8.75  2541   275  2     1.38 564375  370.
10 2001-01-16  2.75  4.88  14.9  8.94  2772   300  2.12  1.62 595000  358.
# … with 5,066 more rows

Upvotes: -1

Waldi
Waldi

Reputation: 41260

Try:

example[,lapply(.SD, function(x) {if(anyNA(x)) {NULL} else {x}} )]

Upvotes: 2

Related Questions