Reputation: 143
Analysing Facebook comments in R for Sentimental Analysis. Emojis are coding in text between <> symbols.
Example:
"Jesus te ama!!! <U+2764> Ou não...?<U+1F628> (fé em stand by)"
<U+2764>
and <U+1F628>
are emojis (heavy black heart and fearful face,
respectively).
So, I need split words/numbers and punctuations/symbols, except in emoji codes. I did, using gsub function, this:
a1 <- "([[:alpha:]])([[:punct:]])"
a2 <- "([[:punct:]])([[:alpha:]])"
b <- "\\1 \\2"
gsub(a1, b, gsub(a2, b, "Jesus te ama!!! <U+2764> Ou não...?<U+1F628> (fé em stand by)"))
...but, the results, logically, also affects emojis code:
[1] "Jesus te ama !!! < U +2764> Ou não ...?< U +1F628> ( fé em stand by )"
The objective is create a exception for the text between <>, split it externally and don't split internally - i.e.:
[1] "Jesus te ama !!! <U+2764> Ou não ...? <U+1F628> ( fé em stand by )"
Note that:
- sometimes the space between the sentence/word/punct and a emoji code is non-existent (needs to be created)
- It is required that a punct sequence stays join (e.g. "!!!", "...?")
How can I do it?
Upvotes: 2
Views: 971
Reputation: 3140
> str <- "Jesus te ama!!! <U+2764> Ou não...?<U+1F628> (fé em stand by)"
> strsplit(str,"[[:space:]]|(?=[.!?])",perl=TRUE)
[[1]]
[1] "Jesus" "te" "ama" "!" "!" "!"
[7] "" "<U+2764>" "" "Ou" "não" "."
[13] "." "." "?" "<U+1F628>" "(fé" "em"
[19] "stand" "by)"
Upvotes: 1
Reputation: 627082
You may use the following regex solution:
a1 <- "(?<=<)U\\+\\w+>(*SKIP)(*F)|(?<=\\S)(?=<U\\+\\w+>)|(?<=[[:alpha:]])(?=[[:punct:]])|(?<=[[:punct:]])(?=[[:alpha:]])"
gsub(a1, " ", "Jesus te ama!!! <U+2764> Ou não...?<U+1F628> (fé em stand by)", perl=TRUE)
# => [1] "Jesus te ama !!! <U+2764> Ou não ...? <U+1F628> ( fé em stand by )"
See the online R demo
This PCRE regex (see perl=TRUE
argument in the call to gsub
) matches:
(?<=<)U\\+\\w+>(*SKIP)(*F)
- a U+
and 1+ word chars with >
after if preceded with <
- and the match value is discarded with the PCRE verbs (*SKIP)(*F)
and the next match is looked for from the end of this match|
- or(?<=\\S)(?=<U\\+\\w+>)
- a non-whitespace char must be present immediately to the left of the current location, and a <U+
, 1+ word chars and >
must be present immediately to the right of the current location|
- or(?<=[[:alpha:]])(?=[[:punct:]])
- a letter must be present immediately to the left of the current location, and a punctuation must be present immediately to the right of the current location|
- or(?<=[[:punct:]])(?=[[:alpha:]])
- a punctuation must be present immediately to the left of the current location, and a letter must be present immediately to the right of the current locationUpvotes: 1