Reputation: 983
I would like to read in selected columns from a CSV file, using abbreviations supported by the cols
function in the readr
package. However, when I skip columns, readr
tries to guess the column type, rather than using my specification, unless I specify the columns by name or set a default.
Here's a reproducible example:
library(tidyverse)
out <- tibble(a = c(1234, 5678),
b = c(9876, 5432),
c = c(4321, 8901))
write_csv(out, "test.csv")
test <- read_csv("test.csv",
col_select = c(a, c),
col_types = "cc")
typeof(test$c)
#> [1] "double"
I can get the correct specification by explicitly indicating the column name:
test2 <- read_csv("test.csv",
col_select = c(a, c),
col_types = c(a = "c", c = "c"))
typeof(test2$c)
#> [1] "character"
I can also get the correct specification by setting character as the default, as suggested in this Q&A. But I'm wondering if there is a way to get the correct specification using the abbreviation "cc" or -- alternatively -- how to generate an abbreviation string based on the columns that were skipped. My real use case involves a large number of skipped columns, so I don't want to use -
or _
to specify the skipped columns.
Upvotes: 0
Views: 709
Reputation: 1003
Sorry, I've rewritten what I wrote earlier to be more clear based on an assumed understanding of what you are asking.
If you want to get the col_types for the columns in your csv file prior to any skipping or manual changes then the easiest thing to do is to use the spec_csv()
argument of your file which generate a col class text that will show you how read_csv()
will classify each column type.
From there you can copy, paste and edit that into your col_types
argument to only bring in the columns & column types that you want. That can be done using the cols_only()
argument instead of cols()
.
spec_csv("test.csv")
This will automatically generate in your output console:
cols(
a = col_double(),
b = col_double(),
c = col_double()
)
The output will tell you what the default reader column types would be (PS you can manipulate the spec_csv()
argument just like the read_csv
argument to increase the guess size eg.guess_max
for the column types.
#manually copied and pasted the above output, changed the default to the desired type and deleted the columns I didn't want
read_csv("test.csv",
col_types=cols_only(a = col_character(),
c = col_character())
)
I used the long form (col_character) but you can instead you the abbreviation as you already indicated earlier.
Please let me know if this is what you were asking or if there is any clarity that I can provide.
Upvotes: 1
Reputation: 33782
See the documentation for col_types
in ?read_csv
. You can use _
or -
to specify a skipped column:
read_csv("test.csv",
col_select = c(a, c),
col_types = "c-c")
Result:
# A tibble: 2 x 2
a c
<chr> <chr>
1 1234 4321
2 5678 8901
Upvotes: 1