Henry Navarro
Henry Navarro

Reputation: 953

Removing repeated characters in strings

This question could be related with this question.

Unfortunately the solution given there doesn't work with my data.

I have the following vector example:

example<-c("ChildrenChildren", "Clothing and shoesClothing and shoes","Education, health and beautyEducation, health and beauty", "Leisure activities, travelingLeisure activities, traveling","LoansLoans","Loans and financial servicesLoans and financial services" ,"Personal transfersPersonal transfers" ,"Savings and investmentsSavings and investments","TransportationTransportation","Utility servicesUtility services")

And I want of course the same strings without repetition, that is:

  > result
 [1]   "Children" "Clothing and shoes" "Education, health and beauty"

Is that possible?

Upvotes: 4

Views: 89

Answers (3)

NelsonGon
NelsonGon

Reputation: 13319

We could try:

stringr::str_remove_all(example,"[a-z].*[A-Z]")

Result:

[1] "Children"                      "Clothing and shoes"            "Education, health and beauty" 
 [4] "Leisure activities, traveling" "Loans"                         "Loans and financial services" 
 [7] "Personal transfers"            "Savings and investments"       "Transportation"               
[10] "Utility services"  

Upvotes: 3

Spacedman
Spacedman

Reputation: 94192

If all the strings are repeated, then they are twice as long as they need to be, so take the first half of each string:

> substr(example, 1, nchar(example)/2)
 [1] "Children"                      "Clothing and shoes"           
 [3] "Education, health and beauty"  "Leisure activities, traveling"
 [5] "Loans"                         "Loans and financial services" 
 [7] "Personal transfers"            "Savings and investments"      
 [9] "Transportation"                "Utility services"             

Upvotes: 5

Cath
Cath

Reputation: 24074

You can use sub for that, directly capturing the bit you want in the pattern part:

sub("(.+)\\1", "\\1", example)
 #[1] "Children"                      "Clothing and shoes"            "Education, health and beauty"  "Leisure activities, traveling" "Loans"                        
 #[6] "Loans and financial services"  "Personal transfers"            "Savings and investments"       "Transportation"                "Utility services"

(.+) permits to capture some pattern and \\1 displays what you just captured so what you're trying to find is "anything twice" and then you replace with the same "anything" but just once.

Upvotes: 10

Related Questions