Unable to extract Name from text in R

Question

I am trying to figure out what regex to use to extract the name from text. Each name has a first initial, a period, and last name followed by what seems to be a code for space () which for some reason shows up when I uploaded the csv to R.

Here are four examples of how the text is laid out:

D. Nowitzkimisses 2-pt jump shot from 17 ft
J. Calderonmisses 2-pt jump shot from 12 ft
Turnover byM. Ellis(bad pass; steal byT. 
Splitter)

Defensive rebound byS. Marion

    data$Player <- sub("(.*\..*)<", "\1", data$Play)

    data$Player <- sub("(.*\..*)<", "\1", data$Play)

Wiktor Stribiżew · Accepted Answer

Your pattern, (.*\..*)<, captures into Group 1 any 0+ chars as many as possible, then a . char, then any 0+ chars as many as possible and then a < is matched. So, you match quite a lot of text and it is not quite clear if is a literal text or if it is an entity standing for a non-breaking space in your data. If the latter is true, your pattern is just not matching because there is no <.

I assume you want to extract the first match starting with a letter as a whole word followed with a dot, then any 0 or more whitespaces and then 1+ letters. Hence, you may use

\b\p{Lu}\.\s*\p{L}+

See the regex demo.

Details

\b - a word boundary
\p{Lu} - any uppercase Unicode letter -\. - a dot
\s* - 0+ whitespaces
\p{L}+ - any 1+ Unicode letters

In R, you may easily use the pattern with stringr::str_extract that extracts the first match only:

res <- stringr::str_extract(data$Play, "\b\p{Lu}\.\s*\p{L}+")

Unable to extract Name from text in R

Answers (1)

Related Questions