Reputation: 47
I have a variable that usually has some gibberish like:
\n\t\n\t\n\t\n\t\tSeuat eselyt\n\t\t\t\t\t\n\t\t\tti 30.07.2019 klo 12:00 - 14:30\n\t\t\t\t\t\t\tTau ski 2342342 2342342\n\t\t\t\t\t\n\t\n
I am trying to extract the date (30.07.2019) and time (12:00 - 14:30). I am not very good with parsing so some help with implementing this in R would be appreciated.
Upvotes: 3
Views: 92
Reputation: 13319
A kind of lengthy step by step base
/stringr
approach:
tst<-"\n\t\n\t\n\t\n\t\tSeuat eselyt\n\t\t\t\t\t\n\t\t\tti 30.07.2019 klo 12:00 - 14:30\n\t\t\t\t\t\t\tTau ski 2342342 2342342\n\t\t\t\t\t\n\t\n"
cleaner<-gsub("\\n|\\t","",tst)
split_txt<-strsplit(cleaner, "\\s(?=[a-z])",perl=T)
dates<-stringr::str_extract_all(unlist(split_txt),
"\\d{1,}\\.\\d{2,}\\.\\d{4}")
times<-stringr::str_extract_all(stringr::str_remove_all(unlist(split_txt),
"[A-Za-z]"),".*\\-.*")
dates[lengths(dates)>0]
[[1]]
[1] "30.07.2019"
trimws(times[lengths(times)>0])
[1] "12:00 - 14:30"
Upvotes: 1
Reputation:
This for date:
(\d{1,2}[\.\/]){2}((\d{4})|(\d{2}))
This for time:
\d{1,2}:\d{2}\s?-\s?\d{1,2}:\d{2}
Upvotes: 1
Reputation: 56159
String split, then extract date and times:
x <- "\n\t\n\t\n\t\n\t\tSeuat eselyt\n\t\t\t\t\t\n\t\t\tti 30.07.2019 klo 12:00 - 14:30\n\t\t\t\t\t\t\tTau ski 2342342 2342342\n\t\t\t\t\t\n\t\n"
lapply(strsplit(x, "[\n\t ]"), function(i){
dd <- i[ grepl("[0-9]{2}.[0-9]{2}.[0-9]{2}", i) ]
tt <- i[ grepl("[0-9]{2}:[0-9]{2}", i) ]
c(dd, paste(tt, collapse = "-"))
})
# [[1]]
# [1] "30.07.2019" "12:00-14:30"
Upvotes: 1
Reputation: 428
If you can rely on the fact that the date and time part only occur once in your data you could use regular expressions to extract them (here using a dataframe):
library(tidyverse)
data <-
tibble(gibberish_string = "\n\t\n\t\n\t\n\t\tSeuat eselyt\n\t\t\t\t\t\n\t\t\tti 30.07.2019 klo 12:00 - 14:30\n\t\t\t\t\t\t\tTau ski 2342342 2342342\n\t\t\t\t\t\n\t\n")
data %>% mutate(date = str_extract(gibberish_string,
pattern = "\\d{1,2}\\.\\d{1,2}\\.\\d{4}"),
time = str_extract(gibberish_string,
pattern = "\\d{1,2}:\\d{1,2}"))
Upvotes: 2