Learning_how_to_model
Learning_how_to_model

Reputation: 51

How to remove Arabic text from string

I have copied some data describing cholera cases in regions of Yemen from an online database into a text file. The names of each region are given in both English and Arabic in a single string. I would like to remove the Arabic in R, and be left with just the English names.

This is what the English/Arabic string looks like when read into R:

regions <- c("Al Hudaydah الحديدة", "Hajjah حجة")

I would like to be left with just the English "Al Hudaydah" "Hajjah"

I have tried using str_replace_all(regions, "[^[:alnum:]]", "") and replace_non_ascii(regions) but it doesn't give me what I'm looking for.

Any ideas?

Thanks!

Upvotes: 2

Views: 1074

Answers (2)

Learning_how_to_model
Learning_how_to_model

Reputation: 51

Edit: I have found the solution to my problem. The issue was in the reading in of the text file. If it contains Arabic (or presumably any non-latin scripts), you need to use encoding = 'UTF-8'

e.g.

txt <- readLines("Arabic_English_script.txt") returns

"Al Hudaydah الحديدة" "Taizz تعز"

whereas txt <- readLines("Arabic_English_script.txt", encoding = 'UTF-8') returns

"Al Hudaydah الحديدة" "Taizz تعز"

Once the text has been properly imported, then gsub("[^[:alnum:]]", "", txt) returns

"AlHudaydah" "Taizz"

(Note, it still removes the spaces. Not sure how to fix that one.)

Upvotes: 1

Hugh
Hugh

Reputation: 16090

The simplest approach may be to simply use gsub

gsub("[^A-Za-z0-9 ]", "", regions)

Upvotes: 1

Related Questions