Reputation: 385
I have a character vector where I'd like to match a specific string and then collapse the element containing that string match only with the next element in the character vector and then allow the process to continue until the character vector ends. For example just one situation:
'"FundSponsor:Blackrock Advisors" "Category:" "Tax-Free Income-Pennsylvania" "Ticker:" "MPA" "NAV Ticker:" "XMPAX" "Average Daily Volume (shares):" "26,000" "Average Daily Volume (USD):" "$0.335M" "Inception Date:" "10/30/1992" "Inception Share Price:" "$15.00" "Inception NAV:" "$14.18" "Tender Offer:" "No" "Term:" "No"'
Combining each element containing a :
with only the element following it would be great BUT I've struggled with using the paste function because it just generally collapses the entire vector based on the :
into one element which is not the more targeted solution I'm looking for.
Here's an example of what I'd like a portion of the revised output to look like:
"Inception Share Price:$15.00"
Upvotes: 0
Views: 721
Reputation: 5281
Here is something that might help:
First split using strsplit
, then bind elements that belong together
# split the string
vec <- unlist(strsplit(string, '(?=\")(?=\")', perl = TRUE))
vec <- vec[! vec %in% c(' ', '\"')]
# that's how vec looks like right now
head(vec)
# [1] "FundSponsor:Blackrock Advisors" "Category:" "Tax-Free Income-Pennsylvania" "Ticker:" "MPA"
# [6] "NAV Ticker:"
#
# now paste the elements
ind <- grepl(':.+',vec)
tmp <- vec[!ind]
vec[!ind] <- paste0(tmp[seq(1,length(tmp),2)], tmp[seq(2,length(tmp),2)])
head(vec)
# [1] "FundSponsor:Blackrock Advisors" "Category:Tax-Free Income-Pennsylvania" "Ticker:MPA" "NAV Ticker:XMPAX"
# [5] "Average Daily Volume (shares):26,000" "Average Daily Volume (USD):$0.335M"
with the data
string = "\"FundSponsor:Blackrock Advisors\" \"Category:\" \"Tax-Free Income-Pennsylvania\" \"Ticker:\" \"MPA\" \"NAV Ticker:\" \"XMPAX\" \"Average Daily Volume (shares):\" \"26,000\" \"Average Daily Volume (USD):\" \"$0.335M\" \"Inception Date:\" \"10/30/1992\" \"Inception Share Price:\" \"$15.00\" \"Inception NAV:\" \"$14.18\" \"Tender Offer:\" \"No\" \"Term:\" \"No\""
Explanation
regex
(?=\")(?=\")
basically tells R
to split the string whenever there are two \"
. The syntax (?!*something*)
means *something*
comes before/after. So the above simply reads: split the string at every position that is preceeded by a \"
and that preceeds a \"
.strsplit(...)
above creates elements of the form \"
and
('\"Category:\" \"...'
becomes the vector '\"';'Category:';'\"';' ';'...'
). So by using ! vec %in% c(...)
we remove those unwanted elements.Addendum
If elements of the form "string:"
followed by a " "
are contained, in the above code remove the line vec <- vec[! vec %in% c(' ', '\"')]
and add the lines
vec <- vec[seq(2L, length(vec), 4L)]
vec[vec == ' '] <- NA_character_
Upvotes: 0
Reputation: 1035
I am not sure if you want the outcome to be one single key: value format or if you just want to clean that long string and have it in the following format key1: value1 key2: value2 key3: value3. If this is the case, you can achieve it via the following code:
char = '"FundSponsor:Blackrock Advisors" "Category:" "Tax-Free Income-Pennsylvania" "Ticker:" "MPA" "NAV Ticker:" "XMPAX" "Average Daily Volume (shares):" "26,000" "Average Daily Volume (USD):" "$0.335M" "Inception Date:" "10/30/1992" "Inception Share Price:" "$15.00" "Inception NAV:" "$14.18" "Tender Offer:" "No" "Term:" "No"'
char_tidy = gsub('\\" \\"', " ", char)
# output is below
> char_tidy
[1] "\"FundSponsor:Blackrock Advisors Category: Tax-Free Income-Pennsylvania Ticker: MPA NAV Ticker: XMPAX Average Daily Volume (shares): 26,000 Average Daily Volume (USD): $0.335M Inception Date: 10/30/1992 Inception Share Price: $15.00 Inception NAV: $14.18 Tender Offer: No Term: No\""
Upvotes: 0