I need some pointers on this. Actually, I don't necessarily need a fully-fledged solution here - some pointers to functions and/or packages would be great. The problem: I want to find specific sequences in a character vector. The sequences can be somewhat "underspecified". That means that some of the elements should be fixed, but for some elements it does not matter how long they are or what they are exactly. An example: Suppose I want to find the following pattern in a character vector: The sequence should begin with "Out of" or "out of" The sequence should end with "reasons" In between, there should be other elements. But it does not matter how much elements (also zero would be OK) and what the elements exactly are. In between 1. and 2., there shouldn't be a ".", "!" or "?". There should be a parameter that controls how long the sequence in 3. can maximally be to still produce a result. Return value of the function should be the intervening elements and/or their indices in the vector. So, the function should "behave" like this: c("Out", "of", "specific", "reasons", ".") Return "specific" c("Out", "of", "very", "specific", "reasons", ".") Return c("very", "specific" ) c("out", "of", "curiosity", ".", "He", "had", "his", "reasons") Return "" or NA or NULL , which one doesn't matter - just a signal that there is no result. As I said: I don't need a full solution. Any pointers to packages that already implement such functionality are appreciated! Optimally, I don't want to rely on a solution that first pastes the text and then uses regex for matching. Thanks a lot!

Reputation: 1143

Find sequences of elements in vectors

I need some pointers on this. Actually, I don't necessarily need a fully-fledged solution here - some pointers to functions and/or packages would be great.

The problem: I want to find specific sequences in a character vector. The sequences can be somewhat "underspecified". That means that some of the elements should be fixed, but for some elements it does not matter how long they are or what they are exactly.

An example: Suppose I want to find the following pattern in a character vector:

The sequence should begin with "Out of" or "out of"
The sequence should end with "reasons"
In between, there should be other elements. But it does not matter how much elements (also zero would be OK) and what the elements exactly are.
In between 1. and 2., there shouldn't be a ".", "!" or "?".
There should be a parameter that controls how long the sequence in 3. can maximally be to still produce a result.

Return value of the function should be the intervening elements and/or their indices in the vector.

So, the function should "behave" like this:

c("Out", "of", "specific", "reasons", ".") Return "specific"
c("Out", "of", "very", "specific", "reasons", ".") Return c("very", "specific")
c("out", "of", "curiosity", ".", "He", "had", "his", "reasons") Return "" or NA or NULL, which one doesn't matter - just a signal that there is no result.

As I said: I don't need a full solution. Any pointers to packages that already implement such functionality are appreciated!

Optimally, I don't want to rely on a solution that first pastes the text and then uses regex for matching.

Thanks a lot!

Upvotes: 0

Answers (2)

David O

Reputation: 813

I would be really curious to learn of a package that serves your needs. My inclination would be to collapse the strings and use regular expressions or find a programmer or use perl. But here's one extensible solution in R with a few more cases to experiment on. Not very elegant, but see if this has some utility.

# Recreate data as a list with a few more edge cases
  txt1 <- c(
    "Out of specific reasons.",
    "Out of very specific reasons.",
    "Out of curiosity. He had his reasons.",
    "Out of reasons.",
    "Out of one's mind.",
    "For no particular reason.",
    "Reasons are out of the ordinary.",
    "Out of time and money and for many good reasons, it seems.", 
    "Out of a box, a car, and for random reasons.",
    "Floop foo bar.")
  txt2 <- strsplit(txt1, "[[:space:]]+") # remove space
  txt3 <- lapply(txt2, strsplit, "(?=[[:punct:]])", perl = TRUE) #
  txt <- lapply(txt3, unlist) # create list of tokens from each line

# Define characters to exclude: [. ! and ?] but not [,]
  exclude <- "[.!?]"

# Assign acceptable limit to separation
  lim <- 5 # try 7 and 12 to experiment

# Create indices identifying each of the enumerated conditions
  fun1 <- function(x, pat) grep(pat, x, ignore.case = TRUE)
  index1 <- lapply(txt, fun1, "out")
  index2 <- lapply(txt, fun1, "of")
  index3 <- lapply(txt, fun1, "reasons")
  index4 <- lapply(txt, fun1, exclude)

# Create logical vectors from indices satisfying the conditions
  fun2 <- function(set, val) val[1] %in% set
  cond1 <- sapply(index1, fun2, val = 1) & sapply(index2, fun2, val = 2)
  cond2 <- sapply(index3, "[", 1) < lim + 2 + 2 # position of 'of' + 2
  cond3 <- sapply(index3, max, -Inf) < sapply(index4, min, Inf)

# Combine logical vectors to a single logical vector
  valid <- cond1 & cond2 & cond3
  valid <- ifelse(is.na(valid), FALSE, valid)

# Examine selected original lines
  print(txt1[valid])

# Helper function to extract the starting and the ending element
  fun3 <- function(index2, index3, valid) {
    found <- rep(list(NULL), length(index2))
    found[valid] <- Map(seq, index2[valid], index3[valid])
    found <- lapply(found, tail, -1)
    found <- lapply(found, head, -1)
  }

# Extract starting and ending element from valid list members
  idx <- fun3(index2, index3, valid)

# Return the results or "" for no intervening text or NULL for no match
  ans <- Map(function(x, i) {
    if (is.null(i)) NULL # no match found
    else if (length(i) == 0) "" # no intervening elements
    else x[i]}, # all intervening elements <= lim
  txt, idx)

# Show found (non-NULL) values
  ans[!sapply(ans, is.null)]

Upvotes: 1

Bogdan

Reputation: 161

So let's assume your example

x <- c("Out", "of", "very", "specific", "reasons", ".")

We first need to get the beginning of the indicator

i_Beginning <- as.numeric(grep("Out|out", x))

and the ending

i_end <-  as.numeric(grep("reasons", x))

Need to also check that Out is followed by of

Is_Of <- grepl("Of|of", x[i_Beginning +1])

And if this is true we extract the other elements

if(Is_Of){
extraction <- x[c(i_Beginning +2, i_end -1)]
}
print(extraction)

Upvotes: 0

Find sequences of elements in vectors

Answers (2)

Related Questions