Reputation: 65
This is my current dataset:
c("Jetstar","Qantas", "QantasLink","RegionalExpress","TigerairAustralia",
"VirginAustralia","VirginAustraliaRegionalAirlines","AllAirlines",
"Qantas-allQFdesignatedservices","VirginAustralia-allVAdesignatedservices")
I want to add a space in between airlines name and separate it with space.
For this i tried this code:
airlines$airline <- gsub("([[:lower:]]) ([[:upper:]])", "\\1 \\2", airlines$airline)
But I got the text in the same format as before.
My desired output is as below:
Upvotes: 1
Views: 430
Reputation: 263331
txt <- c("Jetstar","Qantas", "QantasLink","RegionalExpress","TigerairAustralia",
"VirginAustralia","VirginAustraliaRegionalAirlines","AllAirlines",
"Qantas-allQFdesignatedservices","VirginAustralia-allVAdesignatedservices")
You need two different sorts of rules: one for the spaces before the case changes and the other for recurring words ("designated", "services") or symbols ("-"). You could start with a pattern that identified a lowercase character followed by an uppercase character (identified with a character class like "[A-Z]") and then insert a space between those two characters in two capture classes (created with flanking parentheses around a section of a pattern). See the ?regex
Details section for a quick description of character classes and capture classes:
gsub("([a-z])([A-Z])", "\\1 \\2", txt)
You then use that result as an argument that adds a space before any of the recurring words in your text that you want also separated:
gsub("(-|all|designated|services)", " \\1", # second pattern and sub for "specials"
gsub("([a-z])([A-Z])", "\\1 \\2", txt)) #first pattern and sub for case changes
[1] "Jetstar"
[2] "Qantas"
[3] "Qantas Link"
[4] "Regional Express"
[5] "Tigerair Australia"
[6] "Virgin Australia"
[7] "Virgin Australia Regional Airlines"
[8] "All Airlines"
[9] "Qantas - all QF designated services"
[10] "Virgin Australia - all VA designated services"
I see that someone upvoted my earlier answer to Splitting CamelCase in R which was similar, but this one had a few more wrinkles to iron out.
Upvotes: 3
Reputation: 325
I have tried to figure it out and I have come up with something:
library(stringr)
data_vec<- c("Jetstar","Qantas", "QantasLink","RegionalExpress","TigerairAustralia",
"VirginAustralia","VirginAustraliaRegionalAirlines","AllAirlines",
"Qantas-allQFdesignatedservices","VirginAustralia-allVAdesignatedservices")
str_trim(gsub("(?<=[A-Z]{2})([a-z]{1})", " \\1",gsub("([A-Z]{1,2})", " \\1", data_vec)))
I Hope this helps.
Upvotes: 1
Reputation: 1464
This could (almost) do the trick
gsub("([A-Z])", " \\1", airlines)
Borrowed from: splitting-camelcase-in-r
Of course names like Qantas-allQFd… will stil pose a problem because of the two consecutive UpperCase letters ("QF") in the second part of the string.
Upvotes: 1