Jeff Henderson
Jeff Henderson

Reputation: 683

Unable to extract Name from text in R

I am trying to figure out what regex to use to extract the name from text. Each name has a first initial, a period, and last name followed by what seems to be a code for space () which for some reason shows up when I uploaded the csv to R.

Here are four examples of how the text is laid out:

D. Nowitzki<U+00A0>misses 2-pt jump shot from 17 ft
J. Calderon<U+00A0>misses 2-pt jump shot from 12 ft
Turnover by<U+00A0>M. Ellis<U+00A0>(bad pass; steal by<U+00A0>T. 
Splitter)

Defensive rebound byS. Marion

    data$Player <- sub("(.*\\..*)<", "\\1", data$Play)

    data$Player <- sub("(.*\\..*)<", "\\1", data$Play)

Upvotes: 2

Views: 40

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626929

Your pattern, (.*\..*)<, captures into Group 1 any 0+ chars as many as possible, then a . char, then any 0+ chars as many as possible and then a < is matched. So, you match quite a lot of text and it is not quite clear if <U+00A0> is a literal text or if it is an entity standing for a non-breaking space in your data. If the latter is true, your pattern is just not matching because there is no <.

I assume you want to extract the first match starting with a letter as a whole word followed with a dot, then any 0 or more whitespaces and then 1+ letters. Hence, you may use

\b\p{Lu}\.\s*\p{L}+

See the regex demo.

Details

  • \b - a word boundary
  • \p{Lu} - any uppercase Unicode letter -\. - a dot
  • \s* - 0+ whitespaces
  • \p{L}+ - any 1+ Unicode letters

In R, you may easily use the pattern with stringr::str_extract that extracts the first match only:

res <- stringr::str_extract(data$Play, "\\b\\p{Lu}\\.\\s*\\p{L}+")

Upvotes: 2

Related Questions