Rafael Maia
Rafael Maia

Reputation: 73

How to subset vector based on string character?

I have a vector composed of entries such as "ZZZ1Z01Z0ZZ0", "1001ZZ0Z00Z0", and so on, and I want to subset this vector based on conditions such as:

  1. The third character is a Z
  2. The third AND seventh characters are Z
  3. The third AND seventh characters are Z, AND none of the other characters are Z

I tried playing around with strsplit and grep, but I couldn't figure out a way to restrict my conditions based on the position of the character on the string. Any suggestions?

Many thanks!

Upvotes: 7

Views: 12472

Answers (3)

Richie Cotton
Richie Cotton

Reputation: 121077

Expanding Josh's answer, you want

your_dataset <- data.frame(
  z = c("ZZZ1Z01Z0ZZ0", "1001ZZ0Z00Z0")
)
regexes <- c("^..Z", "^..Z...Z", "^[^Z]{2}Z[^Z]{3}Z[^Z]+")

lapply(regexes, function(rx)
{
  subset(your_dataset, grepl(rx, z))
})

Also consider replacing grepl(rx, z) with str_detect(z, rx), using the stringr package. (There's no real difference except for slightly more readable code.)

Upvotes: 4

Dason
Dason

Reputation: 61933

You can do the first two without regular expressions using the substr command to pull out specific characters if you want.

# Grab the third character in each element and compare it to Z
substr(z, 3, 3) == "Z"
# Check if the 3rd and 7th characters are both Z
(substr(z, 3, 3) == "Z") & (substr(z, 7, 7) == "Z")  

However, the regular expression approach Joshua gave is more flexible and trying to implement the third restriction you had using a substr approach would be a pain. Regular expressions are much more well suited for a problem like your third restriction and learning how to use them is never a bad idea.

Upvotes: 2

Joshua Ulrich
Joshua Ulrich

Reputation: 176648

You can do this with regular expressions (see ?regexp for details on regular expressions).

grep returns the location of the match and returns a zero-length vector if no match is found. You may want to use grepl instead, since it returns a logical vector you can use to subset.

z <- c("ZZZ1Z01Z0ZZ0", "1001ZZ0Z00Z0")
# 3rd character is Z ("^" is start of string, "." is any character)
grep("^..Z", z)
# 3rd and 7th characters are Z
grep("^..Z...Z", z)
# 3rd and 7th characters are Z, no other characters are Z
# "[]" defines a "character class" and "^" in a character class negates the match
# "{n}" repeats the preceding match n times, "+" repeats is one or more times
grep("^[^Z]{2}Z[^Z]{3}Z[^Z]+", z)

Upvotes: 12

Related Questions