small_lebowski
small_lebowski

Reputation: 721

Removing unicode symbols from column names

I am trying to grab some statistics from the fifa.com by using XML package. The import is successful but the column names have unicode symbols. I want to remove those symbols.

This is how I have got the data,

library(XML)
url <- "http://www.fifa.com/worldcup/statistics/teams/disciplinary.html"
foulbycountry <- readHTMLTable(url)
foulbycountry1 <- do.call(rbind.data.frame, foulbycountry)

The variable names include two characters that I want to remove. I have tried to create a new object but it is not working. For example,

country <- foulbycountry1$Teams▴▾
fouls.committed <- foulbycountry1$Fouls Committed▴▾

which gives me the following output,

> country <- foulbycountry1$Teams▴▾
Error: unexpected input in "country <- foulbycountry1$Teams�"
> fouls.committed <- foulbycountry1$Fouls Committed▴▾
Error: unexpected symbol in "fouls.committed <- foulbycountry1$Fouls Committed"

Is there any way you can suggest so that I can remove those extra unicode characters?

Upvotes: 1

Views: 2078

Answers (2)

Matthew Plourde
Matthew Plourde

Reputation: 44614

iconv is one option ...

names(foulbycountry1) <- iconv(names(foulbycountry1), to='ASCII', sub='')
names(foulbycountry1)
# [1] "Teams"                           "Teams"                           "Matches Played"                 
# [4] "Yellow Card"                     "Second yellow card and red card" "Red Cards"                      
# [7] "Fouls Committed"                 "Fouls Suffered\r\n"              "Fouls causing a penalty"    

This will remove any non-ASCII characters. One of the columns has linebreaks at the end of it. To remove these, too, you can use

gsub('\r|\n', '', iconv(names(foulbycountry1), to='ASCII', sub=''))

Upvotes: 2

MrFlick
MrFlick

Reputation: 206167

If you want to keep only the printable ASCII character in the column names, you can use

names(foulbycountry1) <- gsub("[^\x20-\x7F]","",names(foulbycountry1))

You can find a list of character codes here. Here we specify the hex values with the \x00 syntax.

Upvotes: 1

Related Questions