Reputation: 1610
Take item<-gregexpr("abab","abababab")
. This object seems to have some very strange properties. For example, ?gregexpr
reports that it's a list of integer vectors, all of typeof
, class
, mode
and storage.mode
says that it's a list, viewing the object in RStudio reports that it's a list containing only the integer vector (1,5), and asking R to return it gives this:
> item
[[1]]
[1] 1 5
attr(,"match.length")
[1] 4 4
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
some of which are clearly not integers.
This gives me my question: What is this object? In particular, I've never seen attr
before. Is this some hidden S3 trickery?
Upvotes: 0
Views: 1091
Reputation: 173813
The answer is that gregexpr
returns a list of numeric vectors. Your own example is maybe a little misleading, because gregexpr
can take a vector of strings for its second argument, not just a single string. It therefore needs to return a numeric vector for each string in the supplied character vector to represent the starting point of the matches in each string. It stores all of these vectors together in a list.
So why doesn't the result look like a normal list of numeric vectors (like you would get with list(1:5, 6:10)
)?. It's because each of the vectors in the list returned by gregexpr
also has three attributes.
Attributes are very useful in R. They are a way of storing supplemental information associated with a variable in a way that doesn't change its other behaviour.
We can set attributes by simply doing this:
x <- 1
attr(x, "Number") <- TRUE
So when we look at x, we get:
x
#> [1] 1
#> attr(,"Number")
#> [1] TRUE
We can access the attribute by calling attr
:
attr(x, "Number")
#> [1] TRUE
But we can still use x
as a numeric variable as if it didn't have this attribute:
x + 2
#> [1] 3
#> attr(,"Number")
#> [1] TRUE
Compare this to an alternative way you might attempt to store useful attributes - in a named list:
x <- list(1, Number = TRUE)
This contains the information we want, but we have lost functionality:
x + 2
#> Error in x + 2: non-numeric argument to binary operator
With regards to gregexpr
, it needs to return a list because it is vectorized. Look:
gregexpr("abab",c("abababab", "zzababzz"))
#> [[1]]
#> [1] 1 5
#> attr(,"match.length")
#> [1] 4 4
#> attr(,"index.type")
#> [1] "chars"
#> attr(,"useBytes")
#> [1] TRUE
#>
#> [[2]]
#> [1] 3
#> attr(,"match.length")
#> [1] 4
#> attr(,"index.type")
#> [1] "chars"
#> attr(,"useBytes")
#> [1] TRUE
We have passed multiple strings, so we need a list of multiple results, one for each string. Each list element reports back on the position of matches, but it also tells you the length of the matches in a named attribute. This is actually quite a neat way of storing all the information you might need about regex matches.
Suppose for example if I had a bunch of strings:
my_strings <- c("A1234", "35467 65432", "13456765")
And I want to know the maximum number of consecutive digits in each string. Then I could do:
result <- gregexpr("\\d+", my_strings)
sapply(result, function(y) max(attr(y, "match.length")))
#> [1] 4 5 8
Upvotes: 3