Reputation: 51
I have copied some data describing cholera cases in regions of Yemen from an online database into a text file. The names of each region are given in both English and Arabic in a single string. I would like to remove the Arabic in R, and be left with just the English names.
This is what the English/Arabic string looks like when read into R:
regions <- c("Al Hudaydah الØديدة", "Hajjah Øجة")
I would like to be left with just the English
"Al Hudaydah" "Hajjah"
I have tried using
str_replace_all(regions, "[^[:alnum:]]", "")
and replace_non_ascii(regions)
but it doesn't give me what I'm looking for.
Any ideas?
Thanks!
Upvotes: 2
Views: 1074
Reputation: 51
Edit: I have found the solution to my problem. The issue was in the reading in of the text file. If it contains Arabic (or presumably any non-latin scripts), you need to use encoding = 'UTF-8'
e.g.
txt <- readLines("Arabic_English_script.txt")
returns
"Al Hudaydah الØديدة" "Taizz تعز"
whereas txt <- readLines("Arabic_English_script.txt", encoding = 'UTF-8')
returns
"Al Hudaydah الحديدة" "Taizz تعز"
Once the text has been properly imported, then gsub("[^[:alnum:]]", "", txt)
returns
"AlHudaydah" "Taizz"
(Note, it still removes the spaces. Not sure how to fix that one.)
Upvotes: 1
Reputation: 16090
The simplest approach may be to simply use gsub
gsub("[^A-Za-z0-9 ]", "", regions)
Upvotes: 1