djhurio
djhurio

Reputation: 5536

Read a UTF-8 text file with BOM

I have a text file with Byte order mark (U+FEFF) at the beginning. I am trying to read the file in R. Is it possible to avoid the Byte order mark?

The function fread (from the data.table package) reads the file, but adds ļ»æ at the beginning of the first variable name:

> names(frame_pers)[1]
[1] "ļ»æreg_date"

The same is with read.csv function.

Currently I have made a function which removes the BOM from the first column name, but I believe there should be a way how to automatically strip the BOM.

remove.BOM <- function(x) setnames(x, 1, substring(names(x)[1], 4))

> names(frame_pers)[1]
[1] "ļ»æreg_date"
> remove.BOM(frame_pers)
> names(frame_pers)[1]
[1] "reg_date"

I am using the native encoding for the R session:

> options("encoding" = "")
> options("encoding")
$encoding
[1] ""

Upvotes: 21

Views: 24447

Answers (3)

Lyan Porto
Lyan Porto

Reputation: 11

I know it's been 8 years but I just had this problem and came across this so it might help. An important detail (mentioned by hadley above) is that it needs to be fileEncoding="UTF-8-BOM" not just encoding="UTF-8-BOM". "encoding" works for a few options but not UTF-8-BOM. Go figure. Found this out here: https://www.johndcook.com/blog/2019/09/07/excel-r-bom/

Upvotes: 1

MichaelChirico
MichaelChirico

Reputation: 34763

This was handled between versions 1.9.6 and 1.9.8 with this commit; update your data.table installation to fix this.

Once done, you can just use fread:

fread("file_name.csv")

Upvotes: 7

hadley
hadley

Reputation: 103948

Have you tried read.csv(..., fileEncoding = "UTF-8-BOM")?. ?file says:

As from R 3.0.0 the encoding ‘"UTF-8-BOM"’ is accepted and will remove a Byte Order Mark if present (which it often is for files and webpages generated by Microsoft applications).

Upvotes: 30

Related Questions