Reputation: 2071
I have this vector:
x <- c("De 1 a 2 semanas", "De 3 a 4 semanas", "Más de 6 semanas", "Menos de 1 semana")
And I'm trying to extract each value by an unique identity:
str_extract(x, "1 sem|1 a 2|3 a 4|5 a 6|de 6 sem")
And it works:
[1] "1 a 2" "3 a 4" "de 6 sem" "1 sem"
However, when I call the vector from the dataframe:
> x$PVS9
[1] "De 1 a 2 semanas" "De 3 a 4 semanas" "Más de 6 semanas" "Menos de 1 semana"
> x$PVS9 <- str_extract(x$PVS9, "1 sem|1 a 2|3 a 4|5 a 6|de 6 sem")
> x$PVS9
[1] "1 a 2" NA NA "1 sem"
Why is it giving those two NA
?
PS: You could find useful this question (and its answer).
Here is the minimal reproducible example:
> dput(x)
structure(list(PVS9 = c("De 1 a 2 semanas", "De 3 a 4 semanas",
"Más de 6 semanas", "Menos de 1 semana"), n = c(1L, 1L, 1L, 3L
), Porcentaje = c(0.17, 0.17, 0.17, 0.5)), row.names = c(NA,
-4L), class = c("tbl_df", "tbl", "data.frame"))
Current Output:
> str_extract(x$PVS9, "1 sem|1 a 2|3 a 4|5 a 6|de 6 sem")
[1] "1 a 2" NA NA "1 sem"
Desired output:
[1] "1 a 2" "3 a 4" "de 6 sem" "1 sem"
Additional information:
Session Info:
> sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=Spanish_Chile.1252 LC_CTYPE=Spanish_Chile.1252 LC_MONETARY=Spanish_Chile.1252 LC_NUMERIC=C LC_TIME=Spanish_Chile.1252
Class:
> class(x$PVS9)
[1] "character"
Encoding:
> Encoding(x$PVS9)
[1] "unknown" "unknown" "unknown" "unknown"
> guess_encoding(x$PVS9)
# A tibble: 3 x 2
encoding confidence
<chr> <dbl>
1 ISO-8859-1 0.98
2 ISO-8859-2 0.88
3 ISO-8859-9 0.33
Also:
> x$PVS9 == y
[1] TRUE FALSE FALSE TRUE
I was thinking solving this changuing the encodings of the vector. Is this possible? If not, is there another way?
EDIT: More additional information asked.
What R think it is:
> sapply(x$PVS9, charToRaw)
$`De 1 a 2 semanas`
[1] 44 65 20 31 20 61 20 32 20 73 65 6d 61 6e 61 73
$`De 3 a 4 semanas`
[1] 44 65 20 33 a0 61 20 34 a0 73 65 6d 61 6e 61 73
$`Más de 6 semanas`
[1] 4d e1 73 20 64 65 20 36 a0 73 65 6d 61 6e 61 73
$`Menos de 1 semana`
[1] 4d 65 6e 6f 73 20 64 65 20 31 20 73 65 6d 61 6e 61
Upvotes: 1
Views: 169
Reputation: 12165
At least part of the problem is due to the presence of strange characters that look the same as normal characters to humans, but are different to the computer:
The charToRaw
converts a character string into the raw hexadecimal values that represent the characters to the computer. Let's take a look at the 2nd string which didn't match for you and compare it to what I see on my computer (where it does match):
# This does NOT match
$`De 3 a 4 semanas`
[1] 44 65 20 33 a0 61 20 34 a0 73 65 6d 61 6e 61 73
# This does match
$`De 3 a 4 semanas`
[1] 44 65 20 33 20 61 20 34 20 73 65 6d 61 6e 61 73
There is a difference: the 5th and 9th numbers are 20
on my system and a0
on yours. What does that mean? You can use the intToUtf8
to see how those characters render, though first we have to convert from hexidecimal to decimal:
# 20 in hexidecimal
# is 32 in decimal
intToUtf8(32)
[1] " "
# a0 in hexidecimal
# is 160 in decimal
intToUtf8(160)
[1] " "
So they both look like spaces to us, but to the computer they're totally different characters. If you look these numbers up on on a UTF-8 lookup table, you'll see that 32 is a normal space and 160 is a no-break space:
32 SPACE
160 NO-BREAK SPACE
Non-breaking spaces (aka  
) are often found in HTML documents to create wider spaces that (as multiple consecutive normal spaces are shortened to just one).
So, how can we fix this? First, let's reproduce your data:
bad_str2 <- paste0('De 3', intToUtf8(160), 'a', intToUtf8(160), '4 semanas')
# Looks the same
bad_str2
[1] "De 3 a 4 semanas"
# But has the non-breaking spaces
charToRaw(bad_str2)
[1] 44 65 20 33 c2 a0 61 c2 a0 34 20 73 65 6d 61 6e 61 73
# Regex does not work:
str_extract(bad_str2, "1 sem|1 a 2|3 a 4|5 a 6|de 6 sem")
[1] NA
Now, we can use gsub
to replace the non-breaking spaces with regular spaces:
# The \u prefix means interpret the following Hexidecimal code as a character
# So \ua0 means, the character specified by hex code 'a0', which is the nbsp
fixed_str <- gsub("\ua0", " ", bad_str2, fixed = TRUE)
# Still looks the same
fixed_str
[1] "De 3 a 4 semanas"
# But regex works now!
str_extract(fixed_str, "1 sem|1 a 2|3 a 4|5 a 6|de 6 sem")
[1] "3 a 4"
Upvotes: 1