chaz
chaz

Reputation: 63

R beginning match count

I am using R and have the following string below:

s <- "\t\t\t   \t\t\thello    world   !  \t\t\thello"

I want to get the match count of whitespaces at the start of the string only, not anywhere else. So the spaces between the content should be ignored and only the start should be counted. The result would be "9" here.

I have tried the following but it only returns a count of "1" ...

sapply(regmatches(s, gregexpr('^(\\s)', s)), length)

I am not very good at regex, any help is appreciated.

Upvotes: 6

Views: 155

Answers (3)

Rich Scriven
Rich Scriven

Reputation: 99331

For matching the first occurrence, regexpr() would be more appropriate than gregexpr(). As a result of that switch, sapply() will no longer be necessary because regexpr() returns an atomic vector whereas gregexpr() returns a list.

You could use the following regular expression, looking at the match.length attribute from the result of regexpr().

attr(regexpr("^\\s+", s), "match.length")
# [1] 9

Explanation of the regular expression:

  • ^ Force the regex to be at the beginning of the string.
  • \\s Space characters: tab, newline, vertical tab, form feed, carriage return, and space.
  • + The preceding item will be matched one or more times.

Reference: http://en.wikibooks.org/wiki/R_Programming/Text_Processing

Upvotes: 3

Avinash Raj
Avinash Raj

Reputation: 174706

You could try this also.

> sapply(gregexpr("[^\\h].*(*SKIP)(*F)|\\h", s, perl = TRUE), length)
[1] 9
> sapply(gregexpr("\\S.*(*SKIP)(*F)|\\h", s, perl = TRUE), length)
[1] 9

\\h matches the horizontal spaces. \S matches a non-space character and the following .* matches all the characters following that non-space character upto the line end. (*SKIP)(*F) makes the match to fail. And the part next to the | that is, \h matches all the remaining horizontal spaces (ie, the spaces which are present at the start.)

Upvotes: 0

hwnd
hwnd

Reputation: 70732

Another way you can solve this is anchoring with \G. The \G feature is an anchor that can match at one of two positions; the beginning of the string, or the point where the last character of last match is consumed.

sapply(gregexpr("\\G\\s", s, perl = TRUE), length)
# [1] 9

Upvotes: 2

Related Questions