Reputation: 11
I'm new at R and I need to prepare a column of names and then impute sex, but I'm having some problems with the preparation of the strings, specifically this is an example of what I have:
Name example:
"alberto eduardo etchegaray de la cerda ."
What I need to do is eliminate all the "de" "del" "lo" "los" "la" "las" "double white spaces" "end of string white spaces" and everything that is interfering with the names.
My code so far to clean the string is (in a second line i will eliminate the spaces):
str_replace_all('alberto eduardo etchegaray de la cerda',
'\\bdel*\\b|\\blos*\\b|\\blas*\\b|.$',
replacement=" ")
and the result:
"alberto eduardo etchegaray cerd "
The problem is that I'm getting some words cut when i need them complete.
Upvotes: 1
Views: 98
Reputation: 49640
Others have given you better regular expressions to use, but did not explain why yours changed "cerda" to "cerd ". (I would recommend using the one by R. Schifini as it is pretty clear.
The problem with your regular expression is the .$
at the end. This tells the function that (if after checking for the other alternatives) it finds any character followed by the end of string, to replace that final character (with the space). In your first example string there is a final .
, but in the string that you pass to str_replace_all
the final character is the "a" in "cerda" that is being replaced. I expect that what you really want to do is to replace a literal .
at the end of the string, so you need \\.$
or [.]$
to match a literal period because the unescaped .
is a special character that matches any single character (except a newline in some cases).
Upvotes: 0
Reputation: 9313
Use this regular expression:
str_replace_all(name,'\\b(del?|los?|las?)\\b|\\.',replacement=" ")
Result:
"alberto eduardo etchegaray cerda "
You could also use the following regexp to avoid inserting double spaces:
str_replace_all(name,'\\s?\\b(del?|los?|las?)\\b|\\.',replacement="")
Result:
"alberto eduardo etchegaray cerda "
Upvotes: 2