Jacek Kotowski
Jacek Kotowski

Reputation: 704

R: regexpr() how to use a vector in pattern parameter

I would like to learn the positions of terms from a dictionary found in a set of short texts. The problem is in the last lines of the following code roughly based on From of list of strings, identify which are human names and which are not

library(tm)

pkd.names.quotes <- c(
  "Mr. Rick Deckard",
  "Do Androids Dream of Electric Sheep",
  "Roy Batty",
  "How much is an electric ostrich?",
  "My schedule for today lists a six-hour self-accusatory depression.",
  "Upon him the contempt of three planets descended.",
  "J.F. Sebastian",
  "Harry Bryant",
  "goat class",
  "Holden, Dave",
  "Leon Kowalski",
  "Dr. Eldon Tyrell"
) 


firstnames <- c("Sebastian", "Dave", "Roy",
                "Harry", "Dave", "Leon",
                "Tyrell")

dict  <- sort(unique(tolower(firstnames)))

corp <- VCorpus(VectorSource(pkd.names.quotes))
#strange but Corpus() gives wrong segment numbers for the matches.

tdm  <-
  TermDocumentMatrix(corp, control = list(tolower = TRUE, dictionary = dict))

inspect(corp)
inspect(tdm)

View(as.matrix(tdm))

data.frame(
  Name      = rownames(tdm)[tdm$i],
  Segment = colnames(tdm)[tdm$j],
  Content = pkd.names.quotes[tdm$j],
  Postion = regexpr(
    pattern = rownames(tdm)[tdm$i],
    text = tolower(pkd.names.quotes[tdm$j])
  )
)

The output is with a warning and only the first line correct.

       Name Segment          Content Postion
1       roy       3        Roy Batty       1
2 sebastian       7   J.F. Sebastian      -1
3     harry       8     Harry Bryant      -1
4      dave      10     Holden, Dave      -1
5      leon      11    Leon Kowalski      -1
6    tyrell      12 Dr. Eldon Tyrell      -1

Warning message:
In regexpr(pattern = rownames(tdm)[tdm$i], text = tolower(pkd.names.quotes[tdm$j])) :
  argument 'pattern' has length > 1 and only the first element will be used

I know the solution with pattern=paste(vector,collapse="|") but my vector can be very long (all popular names).

Can there be an easy vectorized version of this command or a solution that for each row accepts a new pattern parameter?

Upvotes: 6

Views: 3334

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626747

You may vectorize regexpr using mapply:

mapply is a multivariate version of sapply. mapply applies FUN to the first elements of each ... argument, the second elements, the third elements, and so on.

Use

data.frame(
  Name      = rownames(tdm)[tdm$i],
  Segment = colnames(tdm)[tdm$j],
  Content = pkd.names.quotes[tdm$j],
  Postion = mapply(regexpr, rownames(tdm)[tdm$i], tolower(pkd.names.quotes[tdm$j]), fixed=TRUE)
)

Result:

               Name Segment          Content Postion
roy             roy       3        Roy Batty       1
sebastian sebastian       7   J.F. Sebastian       6
harry         harry       8     Harry Bryant       1
dave           dave      10     Holden, Dave       9
leon           leon      11    Leon Kowalski       1
tyrell       tyrell      12 Dr. Eldon Tyrell      11

Alternatively, use stringr str_locate:

Vectorised over string and pattern

It returns:

For str_locate, an integer matrix. First column gives start postion of match, and second column gives end position.

Use

str_locate(tolower(pkd.names.quotes[tdm$j]), fixed(rownames(tdm)[tdm$i]))[,1]

Note that fixed() is used if you need to match the strings with fixed (i.e. non-regex patterns). Else, remove fixed() and fixed=TRUE.

Upvotes: 3

Related Questions