str_extract() gives a different result calling a vector from dataframe - R

Question

I have this vector:

x <- c("De 1 a 2 semanas", "De 3 a 4 semanas", "Más de 6 semanas", "Menos de 1 semana")

And I'm trying to extract each value by an unique identity:

str_extract(x, "1 sem|1 a 2|3 a 4|5 a 6|de 6 sem")

And it works:

[1] "1 a 2"    "3 a 4"    "de 6 sem" "1 sem"

However, when I call the vector from the dataframe:

> x$PVS9
[1] "De 1 a 2 semanas"  "De 3 a 4 semanas"  "Más de 6 semanas"  "Menos de 1 semana"
> x$PVS9 <- str_extract(x$PVS9, "1 sem|1 a 2|3 a 4|5 a 6|de 6 sem")
> x$PVS9
[1] "1 a 2" NA      NA      "1 sem"

Why is it giving those two NA?

PS: You could find useful this question (and its answer).

Here is the minimal reproducible example:

> dput(x)
structure(list(PVS9 = c("De 1 a 2 semanas", "De 3 a 4 semanas", 
"Más de 6 semanas", "Menos de 1 semana"), n = c(1L, 1L, 1L, 3L
), Porcentaje = c(0.17, 0.17, 0.17, 0.5)), row.names = c(NA, 
-4L), class = c("tbl_df", "tbl", "data.frame"))

Current Output:

> str_extract(x$PVS9, "1 sem|1 a 2|3 a 4|5 a 6|de 6 sem")
[1] "1 a 2" NA      NA      "1 sem"

Desired output:

[1] "1 a 2"    "3 a 4"    "de 6 sem" "1 sem"

Additional information:

Session Info:

> sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=Spanish_Chile.1252  LC_CTYPE=Spanish_Chile.1252    LC_MONETARY=Spanish_Chile.1252 LC_NUMERIC=C                   LC_TIME=Spanish_Chile.1252

Class:

> class(x$PVS9)
[1] "character"

Encoding:

> Encoding(x$PVS9)
[1] "unknown" "unknown" "unknown" "unknown"

> guess_encoding(x$PVS9)
# A tibble: 3 x 2
  encoding   confidence
             
1 ISO-8859-1       0.98
2 ISO-8859-2       0.88
3 ISO-8859-9       0.33

Also:

> x$PVS9 == y
[1]  TRUE FALSE FALSE  TRUE

I was thinking solving this changuing the encodings of the vector. Is this possible? If not, is there another way?

EDIT: More additional information asked.

What R think it is:

> sapply(x$PVS9, charToRaw)
$`De 1 a 2 semanas`
 [1] 44 65 20 31 20 61 20 32 20 73 65 6d 61 6e 61 73

$`De 3 a 4 semanas`
 [1] 44 65 20 33 a0 61 20 34 a0 73 65 6d 61 6e 61 73

$`Más de 6 semanas`
 [1] 4d e1 73 20 64 65 20 36 a0 73 65 6d 61 6e 61 73

$`Menos de 1 semana`
 [1] 4d 65 6e 6f 73 20 64 65 20 31 20 73 65 6d 61 6e 61

divibisan · Accepted Answer

At least part of the problem is due to the presence of strange characters that look the same as normal characters to humans, but are different to the computer:

The charToRaw converts a character string into the raw hexadecimal values that represent the characters to the computer. Let's take a look at the 2nd string which didn't match for you and compare it to what I see on my computer (where it does match):

#  This does NOT match
$`De 3 a 4 semanas`
 [1] 44 65 20 33 a0 61 20 34 a0 73 65 6d 61 6e 61 73

# This does match
$`De 3 a 4 semanas`
 [1] 44 65 20 33 20 61 20 34 20 73 65 6d 61 6e 61 73

There is a difference: the 5th and 9th numbers are 20 on my system and a0 on yours. What does that mean? You can use the intToUtf8 to see how those characters render, though first we have to convert from hexidecimal to decimal:

# 20 in hexidecimal
# is 32 in decimal
intToUtf8(32)
[1] " "


# a0 in hexidecimal
# is 160 in decimal
intToUtf8(160)
[1] " "

So they both look like spaces to us, but to the computer they're totally different characters. If you look these numbers up on on a UTF-8 lookup table, you'll see that 32 is a normal space and 160 is a no-break space:

32  SPACE
160 NO-BREAK SPACE

Non-breaking spaces (aka ) are often found in HTML documents to create wider spaces that (as multiple consecutive normal spaces are shortened to just one).

So, how can we fix this? First, let's reproduce your data:

bad_str2 <- paste0('De 3', intToUtf8(160), 'a', intToUtf8(160), '4 semanas')

# Looks the same
bad_str2
[1] "De 3 a 4 semanas"

# But has the non-breaking spaces
charToRaw(bad_str2)
 [1] 44 65 20 33 c2 a0 61 c2 a0 34 20 73 65 6d 61 6e 61 73

# Regex does not work:
str_extract(bad_str2, "1 sem|1 a 2|3 a 4|5 a 6|de 6 sem")
[1] NA

Now, we can use gsub to replace the non-breaking spaces with regular spaces:

# The \u prefix means interpret the following Hexidecimal code as a character
# So \ua0 means, the character specified by hex code 'a0', which is the nbsp
fixed_str <- gsub("\ua0", " ", bad_str2, fixed = TRUE)

# Still looks the same
fixed_str
[1] "De 3 a 4 semanas"

# But regex works now!
str_extract(fixed_str, "1 sem|1 a 2|3 a 4|5 a 6|de 6 sem")
[1] "3 a 4"

str_extract() gives a different result calling a vector from dataframe - R

Answers (1)

Related Questions