Cyrus Mohammadian
Cyrus Mohammadian

Reputation: 5193

Remove html tags from vector when space b/w tags and text varies in r

I have the following vector:

vec<-c("\n\t\t\t\n\t\t\t\n\t\t\t\t8900 E Runstack Rd \n\t\t\t\n\t\t\t\n\t\t\t\n\t\t\tScottsdale,  AZ  \n\t\t\t\t\t85251\n\t\t\t"                              , 
"\n\t\t\t\n\t\t\t\n\t\t\t\t330 Orange Boulevard\n\t\t\t\n\t\t\t\n\t\t\t\n\t\t\tBeverly Hills,  CA  \n\t\t\t\t\t90212\n\t\t\t"                              , 
"\n\t\t\t\n\t\t\t\n\t\t\t\t645 Newport Center Drive \n\t\t\t\n\t\t\t\n\t\t\t\n\t\t\tNewport Beach,  CA  \n\t\t\t\t\t92660\n\t\t\t"                              , 
"\n\t\t\t\n\t\t\t\n\t\t\t\t5000 Westlake Depot Road \n\t\t\t\n\t\t\t\n\t\t\t\n\t\t\tPalo Alto,  CA  \n\t\t\t\t\t94304\n\t\t\t"                              , 
"\n\t\t\t\n\t\t\t\n\t\t\t\t646 Lucern Road\n\t\t\t\n\t\t\t\n\t\t\t\n\t\t\tSan Diego,  CA  \n\t\t\t\t\t92108\n\t\t\t"                              
)

I would like to remove all the \n and \t. I tried the following:

str_replace_all(vec, "\n|\t", " ")
[1] "             8900 E Runstack Rd                 Scottsdale,  AZ        85251    "         
[2] "             330 Orange Boulevard                Beverly Hills,  CA        90212    "     
[3] "             645 Newport Center Drive                 Newport Beach,  CA        92660    "
[4] "             5000 Westlake Depot Road                 Palo Alto,  CA        94304    "    
[5] "             646 Lucern Road                San Diego,  CA        92108    " 

But that converted them to whitespace. I tried this:

str_replace_all(vec, "\n|\t", "")
[1] "8900 E Runstack Rd Scottsdale,  AZ  85251"          "330 Orange BoulevardBeverly Hills,  CA  90212"     
[3] "645 Newport Center Drive Newport Beach,  CA  92660" "5000 Westlake Depot Road Palo Alto,  CA  94304"    
[5] "646 Lucern RoadSan Diego,  CA  92108" 

But note that in some instances there is no whitespace where one should be (such as index 2 330 Orange BoulevardBeverly Hills, CA 90212). The problem is because \n is attached to the end of some text and in other instances there's a space. How can I replace \n with whitespace only when it's touching a letter that comes immediately before it but replace it with no space in all other circumstances? I'm looking for the following result:

[1] "8900 E Runstack Rd Scottsdale,  AZ  85251"          "330 Orange Boulevard Beverly Hills,  CA  90212"     
[3] "645 Newport Center Drive Newport Beach,  CA  92660" "5000 Westlake Depot Road Palo Alto,  CA  94304"    
[5] "646 Lucern Road San Diego,  CA  92108" 

I can achieve the above using str_squish(vec) after having run str_replace_all(vec, "\n|\t", " ") but I would like a single line solution.

Upvotes: 1

Views: 56

Answers (2)

Sada93
Sada93

Reputation: 2835

A single line is possible but we lose readability, and it does indeed become more complex.

gsub("^[\\\n|\\\t]+([0-9a-zA-Z ,]+)[\\\n|\\\t]+([a-zA-Z ,]+)[\\\n|\\\t]+([0-9]{5})[\\\n|\\\t]+$","\\1 \\2 \\3",vec)

Here we take advantage of the fact that the address contains a pattern of

  1. Street Address
  2. City , State
  3. 5 digit Postal code

Upvotes: 1

NelsonGon
NelsonGon

Reputation: 13319

Try: stringr::str_remove_all(vec,"[\n|\t]") Result: Can be put back to your data.

[1] "8900 E Runstack Rd Scottsdale,  AZ  85251"         
[2] "330 Orange BoulevardBeverly Hills,  CA  90212"     
[3] "645 Newport Center Drive Newport Beach,  CA  92660"
[4] "5000 Westlake Depot Road Palo Alto,  CA  94304"    
[5] "646 Lucern RoadSan Diego,  CA  92108" 

Per @Sada93's comment we lose (a) space in the second element, this is admittedly not the best approach to reintroduce the space but here it is:

gsub("BoulevardBeverly","Boulevard Beverly",vec1)#vec1 is the result of the above transformation

Other ways to reintroduce spaces: Just for illustration

vec1<-stringr::str_replace_all(vec,"[\n|\t]","")
vec2<-stringr::str_remove_all(vec1," ")
gsub("([0-9])([a-zA-Z])","\\1 \\2",vec2)

Upvotes: 0

Related Questions