J. Mini
J. Mini

Reputation: 1610

What is the structure of the output of gregexpr?

Take item<-gregexpr("abab","abababab"). This object seems to have some very strange properties. For example, ?gregexpr reports that it's a list of integer vectors, all of typeof, class, mode and storage.mode says that it's a list, viewing the object in RStudio reports that it's a list containing only the integer vector (1,5), and asking R to return it gives this:

> item
[[1]]
[1] 1 5
attr(,"match.length")
[1] 4 4
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

some of which are clearly not integers.

This gives me my question: What is this object? In particular, I've never seen attr before. Is this some hidden S3 trickery?

Upvotes: 0

Views: 1091

Answers (1)

Allan Cameron
Allan Cameron

Reputation: 173813

The answer is that gregexpr returns a list of numeric vectors. Your own example is maybe a little misleading, because gregexpr can take a vector of strings for its second argument, not just a single string. It therefore needs to return a numeric vector for each string in the supplied character vector to represent the starting point of the matches in each string. It stores all of these vectors together in a list.

So why doesn't the result look like a normal list of numeric vectors (like you would get with list(1:5, 6:10))?. It's because each of the vectors in the list returned by gregexpr also has three attributes.

Attributes are very useful in R. They are a way of storing supplemental information associated with a variable in a way that doesn't change its other behaviour.

We can set attributes by simply doing this:

x <- 1
attr(x, "Number") <- TRUE

So when we look at x, we get:

x
#> [1] 1
#> attr(,"Number")
#> [1] TRUE

We can access the attribute by calling attr:

attr(x, "Number")
#> [1] TRUE

But we can still use x as a numeric variable as if it didn't have this attribute:

x + 2
#> [1] 3
#> attr(,"Number")
#> [1] TRUE

Compare this to an alternative way you might attempt to store useful attributes - in a named list:

x <- list(1, Number = TRUE)

This contains the information we want, but we have lost functionality:

x + 2
#> Error in x + 2: non-numeric argument to binary operator

With regards to gregexpr, it needs to return a list because it is vectorized. Look:

gregexpr("abab",c("abababab", "zzababzz"))
#> [[1]]
#> [1] 1 5
#> attr(,"match.length")
#> [1] 4 4
#> attr(,"index.type")
#> [1] "chars"
#> attr(,"useBytes")
#> [1] TRUE
#> 
#> [[2]]
#> [1] 3
#> attr(,"match.length")
#> [1] 4
#> attr(,"index.type")
#> [1] "chars"
#> attr(,"useBytes")
#> [1] TRUE

We have passed multiple strings, so we need a list of multiple results, one for each string. Each list element reports back on the position of matches, but it also tells you the length of the matches in a named attribute. This is actually quite a neat way of storing all the information you might need about regex matches.

Suppose for example if I had a bunch of strings:

my_strings <-  c("A1234", "35467 65432", "13456765")

And I want to know the maximum number of consecutive digits in each string. Then I could do:

result <- gregexpr("\\d+", my_strings)
sapply(result, function(y) max(attr(y, "match.length")))
#> [1] 4 5 8

Upvotes: 3

Related Questions