Reputation: 683
I am trying to figure out what regex to use to extract the name from text. Each name has a first initial, a period, and last name followed by what seems to be a code for space () which for some reason shows up when I uploaded the csv to R.
Here are four examples of how the text is laid out:
D. Nowitzki<U+00A0>misses 2-pt jump shot from 17 ft
J. Calderon<U+00A0>misses 2-pt jump shot from 12 ft
Turnover by<U+00A0>M. Ellis<U+00A0>(bad pass; steal by<U+00A0>T.
Splitter)
Defensive rebound byS. Marion
data$Player <- sub("(.*\\..*)<", "\\1", data$Play)
data$Player <- sub("(.*\\..*)<", "\\1", data$Play)
Upvotes: 2
Views: 40
Reputation: 626929
Your pattern, (.*\..*)<
, captures into Group 1 any 0+ chars as many as possible, then a .
char, then any 0+ chars as many as possible and then a <
is matched. So, you match quite a lot of text and it is not quite clear if <U+00A0>
is a literal text or if it is an entity standing for a non-breaking space in your data. If the latter is true, your pattern is just not matching because there is no <
.
I assume you want to extract the first match starting with a letter as a whole word followed with a dot, then any 0 or more whitespaces and then 1+ letters. Hence, you may use
\b\p{Lu}\.\s*\p{L}+
See the regex demo.
Details
\b
- a word boundary\p{Lu}
- any uppercase Unicode letter
-\.
- a dot \s*
- 0+ whitespaces\p{L}+
- any 1+ Unicode lettersIn R, you may easily use the pattern with stringr::str_extract
that extracts the first match only:
res <- stringr::str_extract(data$Play, "\\b\\p{Lu}\\.\\s*\\p{L}+")
Upvotes: 2