jroberayalas
jroberayalas

Reputation: 969

How to clean string columns (with capital letters and accents) in R?

I'm working with the following dataset, which contains average temperatures in each of the 32 states of Mexico.

library(data.table)

# Read data from website
col.names <- c('ENTIDAD', 'ANYO', 'ENERO', 'FEBRERO', 'MARZO', 'ABRIL', 'MAYO', 'JUNIO',
           'JULIO', 'AGOSTO', 'SEPTIEMBRE', 'OCTUBRE', 'NOVIEMBRE', 'DICIEMBRE', 'UNIDAD')
temperature <- fread('http://201.116.60.46/DatosAbiertos/Temperatura_promedio.csv',
                 col.names = col.names)

The column ENTIDAD has the 32 names of the states. However, all the names appear in capital letters, and there are some weird numbers that replace the letters which are supposed to have accents:

unique(temperature$ENTIDAD)
 [1] "AGUASCALIENTES"                  "BAJA CALIFORNIA"                
 [3] "BAJA CALIFORNIA SUR"             "CAMPECHE"                       
 [5] "COAHUILA  DE ZARAGOZA"           "COLIMA"                         
 [7] "CHIAPAS"                         "CHIHUAHUA"                      
 [9] "DISTRITO FEDERAL"                "DURANGO"                        
[11] "GUANAJUATO"                      "GUERRERO"                       
[13] "HIDALGO"                         "JALISCO"                        
[15] "M\311XICO"                       "MICHOAC\301N DE OCAMPO"         
[17] "MORELOS"                         "NAYARIT"                        
[19] "NUEVO LE\323N"                   "OAXACA"                         
[21] "PUEBLA"                          "QUER\311TARO"                   
[23] "QUINTANA ROO"                    "SAN LUIS POTOS\315"             
[25] "SINALOA"                         "SONORA"                         
[27] "TABASCO"                         "TAMAULIPAS"                     
[29] "TLAXCALA"                        "VERACRUZ DE IGNACIO DE LA LLAVE"
[31] "YUCAT\301N"                      "ZACATECAS" 

Is there a simple way to replace each of these with the following strings?

states <- c('Aguascalientes',
'Baja California',
'Baja California Sur',
'Campeche',
'Chiapas',
'Chihuahua',
'Coahuila',
'Colima',
'DF',
'Durango',
'Guanajuato',
'Guerrero',
'Hidalgo',
'Jalisco',
'Michoacan',
'Morelos',
'Mexico',
'Nayarit',
'Nuevo Leon',
'Oaxaca',
'Puebla',
'Queretaro',
'Quintana Roo',
'San Luis Potosi',
'Sinaloa',
'Sonora',
'Tabasco',
'Tamaulipas',
'Tlaxcala',
'Veracruz',
'Yucatan',
'Zacatecas')

Upvotes: 0

Views: 283

Answers (3)

Warner
Warner

Reputation: 1363

It appears that you have the replacement names you want to change the names in unique(temperature$ENTIDAD) to.

If you already have the names you wish to change the old names to you can use mapvalues from the plyr package to change the names:

temperatures$ENTIDAD <- mapvalues(temperature$ENTIDAD, from=unique(temperature$ENTIDAD), to=states)

Upvotes: 1

baptiste
baptiste

Reputation: 77116

You can set the encoding (probably better via fread), and use tolower for lower case,

x <- temperature$ENTIDAD
Encoding(x) <- "latin1"
# might also want to convert to utf8
# x <- iconv(x,  "latin1", "UTF-8")
cbind(x, tolower(x))

Upvotes: 0

Nate
Nate

Reputation: 10671

I think this will solve you problem:

temperature <- fread('http://201.116.60.46/DatosAbiertos/Temperatura_promedio.csv',
                 col.names = col.names, encoding = "Latin-1")

Upvotes: 1

Related Questions