Reputation: 723
I would like to split my text at 8 words and numbers after it encounters a time.
Example of the text:
s <- 'random random random 19:49 0-2 H 2 ABC TREE LAKE #88 TURTLE random random 03:32 43-21 V 8 XYZ LOG #72 FIRE random random random'
Example of how I would like the text to be split.
'random random random 19:49 0-2 H 2 ABC TREE LAKE #88 TURTLE
random random 03:32 43-21 V 8 XYZ DOG LOG #72 FIRE
random random random'
I know I can find the time multiple ways such as
str_extract(str_extract(s, "[:digit:]*:"), "[:digit:]*")
But I am unsure as how to do the split eight words and numbers after the time. Any help will be greatly appreciated.
Upvotes: 0
Views: 409
Reputation: 4204
s = 'random random random 19:49 0-2 H 2 ABC 19:49 LAKE #88 TURTLE random random 03:32 43-21 V 8 XYZ LOG #72 FIRE random random random'
splitted = strsplit(s, ' ')[[1]]
# [1] "random" "random" "random" "19:49" "0-2" "H" "2" "ABC" "19:49" "LAKE" "#88"
# [12] "TURTLE" "random" "random" "03:32" "43-21" "V" "8" "XYZ" "LOG" "#72" "FIRE"
# [23] "random" "random" "random"
# find two digits + colon + two digits, `^` means begin of string, `$` means end of string
where_time = which( grepl('^\\d{2}:\\d{2}$', splitted) )
# 4 9 15
where_to_break = where_time + 8
# 12 17 23
# if time2 is between time1 and the break of time1, don't break for time2
for (ii in 1:(length(where_time)-1)){
if(is.na(where_time[ii])){
next
}
between = where_time[ii] < where_time & where_time < where_to_break[ii]
where_time[between] = NA
}
where_time = where_time[!is.na(where_time)]
where_to_break = where_time + 8
# 12 23
# if a planned break is after the end of text, it's unnecessary
where_to_break = where_to_break[ where_to_break < length(splitted) ]
# 12 23
s2 = vector('character', length(where_to_break)+1)
# recombine line 1
s2[1] = paste(splitted[ 1:where_to_break[1] ], collapse = ' ')
# last line
s2[(length(s2))] = paste(splitted[ where_to_break[length(where_to_break)]:length(splitted) ], collapse = ' ')
# other lines
for (ii in 2:(length(s2)-1)){
s2[ii] = paste(splitted[ where_to_break[ii-1]:where_to_break[ii] ], collapse = ' ')
}
# recombine lines
s3 = paste(s2, collapse = '\n')
cat(s3)
# random random random 19:49 0-2 H 2 ABC 19:49 LAKE #88 TURTLE
# TURTLE random random 03:32 43-21 V 8 XYZ LOG #72 FIRE random
# random random random
Upvotes: 0
Reputation: 887118
We can replace the space that follows that after 8 instances of one or more space (\\s+
) followed by one or more non-space (\\S+
) (which follows the :
followed by 2 digits) with a ,
and then split
on that delimiter.
strsplit(gsub('((?:\\:\\d{2}(\\s+\\S+){8}))\\s', '\\1,',
s, perl=TRUE), ',')[[1]]
#[1] "random random random 19:49 0-2 H 2 ABC TREE LAKE #88 TURTLE"
#[2] "random random 03:32 43-21 V 8 XYZ DOG LOG #72 FIRE"
#[3] "random random random"
s <- 'random random random 19:49 0-2 H 2 ABC TREE LAKE #88 TURTLE random random 03:32 43-21 V 8 XYZ DOG LOG #72 FIRE random random random'
Upvotes: 6
Reputation: 15784
Approach with a for loop to manage the different cases (I hope I commented enough, feel free to ask if there's something unclear):
s <- 'random random random 19:49 0-2 H 2 ABC TREE LAKE #88 TURTLE random random 03:32 43-21 V 8 XYZ LOG #72 FIRE random random random'
as <- strsplit(s," ")[[1]] # Split the string on space to get the words
nwords <- length(as) # count them (will be reused later)
timepos <- c(grep('\\d+:\\d+',as),nwords) # find the position where it's time, add 1 for last line
start = 1 # initalize start position
lines <- vector('list',length(timepos)) # initialize lines list to avoid growing it in loop
for (i in seq_along(timepos)) { # loop over the lines we need
end<-timepos[i]+8 # compute the end
if (end > nwords) end <- nwords # sanity check, if we're larger than the number of word, just get the end
lines[[i]]<-paste0(as[start:end],collapse=" ") # make the line
start<-end+1 # Update the next start of line
if (start > nwords) break # If we're over the number of words, stop.
}
result <- paste(lines)
Output:
[1] "random random random 19:49 0-2 H 2 ABC TREE LAKE #88 TURTLE"
[2] "random random 03:32 43-21 V 8 XYZ LOG #72 FIRE random"
[3] "random random"
Upvotes: 1