nicholas
nicholas

Reputation: 983

How to specify column types with abbreviations when skipping columns with read_csv

I would like to read in selected columns from a CSV file, using abbreviations supported by the cols function in the readr package. However, when I skip columns, readr tries to guess the column type, rather than using my specification, unless I specify the columns by name or set a default.

Here's a reproducible example:

library(tidyverse)

out <- tibble(a = c(1234, 5678),
       b = c(9876, 5432),
       c = c(4321, 8901))

write_csv(out, "test.csv")

test <- read_csv("test.csv",
                 col_select = c(a, c),
                 col_types = "cc")

typeof(test$c)
#> [1] "double"

I can get the correct specification by explicitly indicating the column name:

test2 <- read_csv("test.csv",
                 col_select = c(a, c),
                 col_types = c(a = "c", c = "c"))
typeof(test2$c)
#> [1] "character"

I can also get the correct specification by setting character as the default, as suggested in this Q&A. But I'm wondering if there is a way to get the correct specification using the abbreviation "cc" or -- alternatively -- how to generate an abbreviation string based on the columns that were skipped. My real use case involves a large number of skipped columns, so I don't want to use - or _ to specify the skipped columns.

Upvotes: 0

Views: 709

Answers (2)

alejandro_hagan
alejandro_hagan

Reputation: 1003

Sorry, I've rewritten what I wrote earlier to be more clear based on an assumed understanding of what you are asking.

If you want to get the col_types for the columns in your csv file prior to any skipping or manual changes then the easiest thing to do is to use the spec_csv() argument of your file which generate a col class text that will show you how read_csv() will classify each column type.

From there you can copy, paste and edit that into your col_types argument to only bring in the columns & column types that you want. That can be done using the cols_only() argument instead of cols().

spec_csv("test.csv")

This will automatically generate in your output console:

cols(
  a = col_double(),
  b = col_double(),
  c = col_double()
)

The output will tell you what the default reader column types would be (PS you can manipulate the spec_csv() argument just like the read_csv argument to increase the guess size eg.guess_max for the column types.

#manually copied and pasted the above output, changed the default to the desired type and deleted the columns I didn't want

read_csv("test.csv",
         col_types=cols_only(a = col_character(),
                             c = col_character())
  )

I used the long form (col_character) but you can instead you the abbreviation as you already indicated earlier.

Please let me know if this is what you were asking or if there is any clarity that I can provide.

Upvotes: 1

neilfws
neilfws

Reputation: 33782

See the documentation for col_types in ?read_csv. You can use _ or - to specify a skipped column:

read_csv("test.csv",
         col_select = c(a, c),
         col_types = "c-c")

Result:

# A tibble: 2 x 2                                                                                                                                                                                                      
  a     c    
  <chr> <chr>
1 1234  4321 
2 5678  8901

Upvotes: 1

Related Questions