Reputation: 1044
I am parsing text downloaded from the internet to my disk in R via emacs. Downloading was done by different machines using different character encodings. I am having trouble getting regular expressions to hit a list of names in the text due to character encoding differences (they are Spanish names, with accented vowels). Any help understanding character encoding in general, and its application to R in particular, will be much appreciated.
I found good tutorials on encoding for python here but nothing helpful yet for R. I also have Jeff Friedl's Mastering Regular Expressions book but haven't found the answer there.
This is a summary of my locale:
> Sys.getlocale()
[1] "LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C"
I begin by reading text. The first instance is in UTF-8, the second in Latin-1 (I have no way to determine this a priori for thousands of text files that I will be parsing).
info1 <- readLines( con= "~/data/file1.txt" , encoding = "utf8" ) ## works for i = 1
info2 <- readLines( con= "~/data/file2.txt" , encoding = "latin1" ) ## works for i = 2
info1w <- readLines( con= "~/data/file1.txt" , encoding = "latin1" ) ## wrong encoding
I will look for the name Jiménez Esquivel Laura (I hope the accented e three words back shows in your machine---it does in mine---as well as some special characters in the code below)
> info1[grep(info, pattern="nez Esquivel", perl=TRUE)]
[1] "<td width=\"400\" class=\"linkVerde\" >Jiménez Esquivel Laura</a></span></td>"
> info2[grep(info, pattern="nez Esquivel", perl=TRUE)]
[1] "<td width=\"400\" class=\"linkVerde\" >Jiménez Esquivel Laura</a></span></td>"
> info1w[grep(info, pattern="nez Esquivel", perl=TRUE)]
[1] "<td width=\"400\" class=\"linkVerde\" >Jiménez Esquivel Laura</a></span></td>"
If I were to search for the accented e (which I removed on purpose from the grep search string for this example) it would not hit the name in info1w. How can I proceed with thousands of files ignoring their character encoding a priori?
Upvotes: 1
Views: 1416
Reputation: 1044
Thank you Roman Luštrik for an example that inspired a solution. I thought of a name that should always appear in a log roster and has special characters (Martínez). I read every file recursively changing the character encoding, searching for the chosen name. Recording misses indicated which files were not encoded properly, passing them to the next loop. Two runs sufficed in my case, but other encodings could be added to the code below.
encod <- rep(NA, I)
for (i in 1:I){
info <- readLines( con=filenames[i] , encoding = "utf8" )
encod[i] <- ifelse( length(grep(info, pattern="Martínez", perl=TRUE))>0, "utf8", NA) # the name should appear in every file
}
for(i in which(is.na(encod)==TRUE)){
info <- readLines( con=filenames[i] , encoding = "latin1" )
encod[i] <- ifelse( length(grep(info, pattern="Martínez", perl=TRUE))>0, "latin1", NA) # the name should appear in every file
}
When object encod has no NAs left, the process is over.
> which(is.na(encod)==TRUE)
integer(0)
Finally, object encod can assign correct character encodings
for (i in 1:I){
info <- readLines( con=filenames[i] , encoding = encod[i] )
}
Upvotes: 1