Reputation: 1071
I have a .csv datafile with many columns. Unfortunately, string values do not have quotation marks (i.e., apples i.o. "apples). When I use read_csv from the readr package, the string values are imported as characters:
library(readr)
mydat = data.frame(first = letters, numbers = 1:26, second = sample(letters, 26))
write.csv(mydat, "mydat.csv", quote = FALSE, row.names = FALSE)
read_csv("mydat.csv")
results in:
Parsed with column specification:
cols(
first = col_character(),
numbers = col_integer(),
second = col_character()
)
# A tibble: 26 x 3
first numbers second
<chr> <int> <chr>
1 a 1 r
2 b 2 n
3 c 3 m
4 d 4 z
5 e 5 p
6 f 6 j
7 g 7 u
8 h 8 l
9 i 9 e
10 j 10 h
# ... with 16 more rows
Is there a way to force read_csv to import the string values as factors i.o. characters?
Importantly, my datafile has so many columns (string and numeric variables) that, AFAIK, there is no way to make this work by providing column specifications with the col_types argument.
Alternative solutions (e.g. using read.csv to import the data, or dplyr code to change all character variables in a dataframe to factors) are appreciated too.
Update: I learned that whether or not the values in the csv file have quotes or not makes no difference for read.csv or read_csv. read.csv will import these values as factors; read_csv will import them as characters. I prefer to use read_csv because it's considerably faster than read.csv.
Upvotes: 8
Views: 8093
Reputation: 10955
There's no version of stringsAsFactors = FALSE
in read_csv
unfortunately, and I think col_types=
requires specific columns without more trickery.
A straightforward solution is to convert strings to factors, using across
in dplyr instead of the superseded mutate_if
:
df %>% mutate(across(where(is.character), factor))
By default, base R's factor
infers the levels and ordering unless specified. where
can also handle more complicated predicates, and you can use tidyselect for a lot more control.
Feature request: registration of custom column types and parsers
Feature request: flexible col_types specification
Upvotes: 1
Reputation: 19413
I like the alistaire's mutate_if() solution in the comments above, but for completeness, there is another solution which should be mentioned. You can use unclass() which will force a re-parse. You'll see this in a lot of code that uses readr.
df <- data.frame(unclass(fr))
or
df <- df %>% unclass %>% data.frame
Upvotes: 3
Reputation: 71
This function uses dplyr to convert all character columns in a tbl_df or data frame to factors:
char.to.factors <- function(df){
# This function takes a tbl_df and returns same with any character column converted to a factor
require(dplyr)
char.cols = names(df)[sapply(df, function(x) {class(x) == "character" })]
tmp = mutate_each_(df, funs(as.factor), char.cols)
return(tmp)
}
Upvotes: 2