prestono
prestono

Reputation: 78

Splitting character object using vector of delimiters

I have a large number of text files. Each file is stored as an observation in a dataframe. Each observation contains multiple fields so there is some structure in each object. I'm looking to split each based on the structured information within each file.

Data is currently in the following structure (simplified):

a <- c("Name: John Doe  Age: 50  Address Please give full address 22 Main Street, New York")
b <- c("Name: Jane Bloggs  Age: 42  Address Please give full address 1 Lower Street, London")

df <- data.frame(rawtext = c(a,b))

I'd like to split each observation into individual variable columns. It should end up looking like this:

Name          Age      Address
John Doe      50       22 Main Street, New York
Jane Bloggs   42       1 Lower Street, London

I thought that this could be done fairly simply using a pre-defined vector of delimiters since each text object is structured. I have tried using stringr and str_split() but this doesn't handle the vector input. e.g.

delims <- c("Name:", "Age", "Address Please give full address")
str_split(df$rawtext, delims)

I'm perhaps trying to oversimplify here. The only other approach I can think of is to loop through each observation and extract all text after delims[1] and before delims[2] (and so on) for all fields.

e.g. the following bodge would get me the name field based on the delimiters:

sub(paste0(".*", delims[1]), "", df$rawtext[1]) %>% sub(paste0(delims[2], ".*"), "", .)

[1] " John Doe  "

This feels extremely inefficient. Is there a better way that I'm missing?

Upvotes: 1

Views: 118

Answers (1)

Darren Tsai
Darren Tsai

Reputation: 35554

A tidyverse solution:

library(tidyverse)
delims <- c("Name", "Age", "Address Please give full address")

df %>%
  mutate(rawtext = str_remove_all(rawtext, ":")) %>% 
  separate(rawtext, c("x", delims), sep = paste(delims, collapse = "|"), convert = T) %>% 
  mutate(across(where(is.character), str_squish), x = NULL)

# # A tibble: 2 x 3
#   Name          Age `Address Please give full address`
#   <chr>       <dbl> <chr>                             
# 1 John Doe       50 22 Main Street, New York          
# 2 Jane Bloggs    42 1 Lower Street, London

Note: convert = T in separate() converts Age from character to numeric ignoring leading/trailing whitespaces.

Upvotes: 1

Related Questions