I have a character vector (myVector) which contains several instances of email addresses scattered through a long string of semi-cleaned HTML stored in a single entry in the vector. I know the relevant domain name ("@domain.com") and I want to extract each email address associated with that domain name (e.g. "help@domain.com") preceded by white space. I have tried the following code, but it doesn't deliver the right substring indices: gregexpr("\\s .+?@domain.com", myVector) Any thoughts on (a) how I can fix the regular expression, and (b) whether there is a more elegant solution?

Reputation: 8458

Extracting email addresses (with a known domain) from a character vector in R

I have a character vector (myVector) which contains several instances of email addresses scattered through a long string of semi-cleaned HTML stored in a single entry in the vector.

I know the relevant domain name ("@domain.com") and I want to extract each email address associated with that domain name (e.g. "[email protected]") preceded by white space.

I have tried the following code, but it doesn't deliver the right substring indices:

gregexpr("\\s [email protected]", myVector)

Any thoughts on (a) how I can fix the regular expression, and (b) whether there is a more elegant solution?

Upvotes: 1

Answers (3)

Marek

Reputation: 50753

You want space followed by no-spaces so gregexpr("\\s\\[email protected]", myVector) should be fine (but it counts extra space on start).

As an alternative solution take look at stringr package:

library(stringr)
str_extract_all(myVector, "\\s\\[email protected]")

Or use str_extract_all(myVector, "\\[email protected]") which returns also adressed at the start of the string (and without extra space).

Examples:

myVector <- "[email protected] and [email protected] and [email protected]. What about:[email protected] and [email protected]"
gregexpr("\\s\\[email protected]", myVector)
# [[1]]
# [1] 19 38 61 87
# attr(,"match.length")
# [1] 15 17 22 16
# attr(,"useBytes")
# [1] TRUE

str_extract_all(myVector, "\\s\\[email protected]")
# [1] " [email protected]"        " [email protected]"      " about:[email protected]"
# [4] " [email protected]"   

str_extract_all(myVector, "\\[email protected]")
# [1] "[email protected]"        "[email protected]"        "[email protected]"     
# [4] "about:[email protected]" "[email protected]"

(about:four is some corner case to think about)

Upvotes: 1

Pierre Lapointe

Reputation: 16277

Using grep and value = TRUE:

str1 <-"Long text with email addresses [email protected] and [email protected] throughout [email protected]"
str1 <-unlist(strsplit(str1, " ")) #split on spaces
grep("@domain.com", str1, value = TRUE)
#[1] "[email protected]" "[email protected]"

Upvotes: 1

Nancy

Reputation: 4109

I tried to replicate your question with a small example by creating a single string that has a few emails included in it.

> foo = "[email protected] some filler text to use an [email protected] example for this 
[email protected] question [email protected] that OP has has asked"

> strsplit(foo, " ")
[[1]]
 [1] "[email protected]"       "some"                   "filler"                
 [4] "text"                   "to"                     "use"                   
 [7] "an"                     "[email protected]"       "example"               
[10] "for"                    "this\[email protected]" "question"              
[13] "[email protected]"       "that"                   "OP"                    
[16] "has"                    "has"                    "asked"

> strsplit(foo, " ")[[1]][grep("@gmail.com", strsplit(foo, " ")[[1]])]

[1] "[email protected]"       "[email protected]"       "this\[email protected]"
[4] "[email protected]"

Upvotes: 1

Extracting email addresses (with a known domain) from a character vector in R

Answers (3)

Related Questions