mario19088
mario19088

Reputation: 101

Split string keeping spaces in R

I would like to prepare a table from raw text using readr::read_fwf. There is an argument col_position responsible for determining columns width which in my case could differ. Table always includes 4 columns and is based on 4 first words from the string like besides one: category variable description value sth

> text_for_column_width = "category    variable   description      value      sth"
> nchar("category    ")
[1] 12
> nchar("variable   ")
[1] 11
> nchar("description      ")
[1] 17
> nchar("value      ")
[1] 11

I want obtain 4 first words but keeping spaces to have category with 8[a-b]+4[spaces] characters and finally create a vector including number of characters for each of four names c(12,11,17,11). I tried using strsplit with space split argument and then calculate existing zeros however I believe there is faster way just using proper regular expression.

Upvotes: 1

Views: 625

Answers (3)

Alvaro Morales
Alvaro Morales

Reputation: 1925

You can also use this pattern:

stringr::str_split("category    variable   description      value      sth", "\\s+") %>%
unlist() %>%
purrr::map_int(nchar)

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626709

You can use utils::strcapture:

text_for_column_width = "category    variable   description      value      sth"
pattern <- "^(\\S+\\s+)(\\S+\\s+)(\\S+\\s+)(\\S+\\s*)"
result <- utils::strcapture(pattern, text_for_column_width, list(f1 = character(), f2 = character(), f3 = character(), f4 = character()))
nchar(as.character(as.vector(result[1,])))
## => [1] 12 11 17 11

See the regex demo. The ^(\S+\s+)(\S+\s+)(\S+\s+)(\S+\s*) matches

  • ^ - start of string
  • (\S+\s+) - Group 1: one or more non-whitespace chars and then one or more whitespaces
  • (\S+\s+) - Group 2: one or more non-whitespace chars and then one or more whitespaces
  • (\S+\s+) - Group 3: one or more non-whitespace chars and then one or more whitespaces
  • (\S+\s*) - Group 4: one or more non-whitespace chars and then zero or more whitespaces

Upvotes: 1

PaulS
PaulS

Reputation: 25323

A possible solution, using stringr:

library(tidyverse)

text_for_column_width = "category    variable   description      value      sth"

strings <- text_for_column_width %>% 
  str_remove("sth$") %>% 
  str_split("(?<=\\s)(?=\\S)") %>% 
  unlist

strings

#> [1] "category    "      "variable   "       "description      "
#> [4] "value      "

strings %>% str_count

#> [1] 12 11 17 11

Upvotes: 4

Related Questions