Pedro López
Pedro López

Reputation: 11

Names string preparation for sex impute

I'm new at R and I need to prepare a column of names and then impute sex, but I'm having some problems with the preparation of the strings, specifically this is an example of what I have:

Name example:

"alberto eduardo etchegaray de la cerda ."

What I need to do is eliminate all the "de" "del" "lo" "los" "la" "las" "double white spaces" "end of string white spaces" and everything that is interfering with the names.

My code so far to clean the string is (in a second line i will eliminate the spaces):

str_replace_all('alberto eduardo etchegaray de la cerda',
                '\\bdel*\\b|\\blos*\\b|\\blas*\\b|.$',
                replacement=" ")

and the result:

"alberto eduardo etchegaray     cerd "

The problem is that I'm getting some words cut when i need them complete.

Upvotes: 1

Views: 98

Answers (2)

Greg Snow
Greg Snow

Reputation: 49640

Others have given you better regular expressions to use, but did not explain why yours changed "cerda" to "cerd ". (I would recommend using the one by R. Schifini as it is pretty clear.

The problem with your regular expression is the .$ at the end. This tells the function that (if after checking for the other alternatives) it finds any character followed by the end of string, to replace that final character (with the space). In your first example string there is a final ., but in the string that you pass to str_replace_all the final character is the "a" in "cerda" that is being replaced. I expect that what you really want to do is to replace a literal . at the end of the string, so you need \\.$ or [.]$ to match a literal period because the unescaped . is a special character that matches any single character (except a newline in some cases).

Upvotes: 0

R. Schifini
R. Schifini

Reputation: 9313

Use this regular expression:

str_replace_all(name,'\\b(del?|los?|las?)\\b|\\.',replacement=" ")

Result:

"alberto eduardo etchegaray     cerda  "

You could also use the following regexp to avoid inserting double spaces:

str_replace_all(name,'\\s?\\b(del?|los?|las?)\\b|\\.',replacement="")

Result:

"alberto eduardo etchegaray cerda "

Upvotes: 2

Related Questions