cowboy
cowboy

Reputation: 661

Split a string in R into rows and columns

I have a data set that looks similar to the sample below:

rows <- c('70150 Markers, Times, Places    72588 Times, Places, Things',
          '51256 Items, Shelves, Cats    99201 Widget, Places, Locations')

I need to split the strings to create useful features. My expected output would be similar to:

Code        Item
70150       Markers, Times, Places
72588       Times, Places, Things
51256       Items, Shelves, Cats
99201       Widget, Places, Locations

I tried using

library(tidyverse)

rows <- c('70150 Markers, Times, Places    72588 Times, Places, Things',
          '51256 Items, Shelves, Cats    99201 Widget, Places, Locations')

rows %>% parse_number

to get the number, but that only gets the first numeric value in the string.

Any suggestions on how to accomplish what I am trying to do?

Upvotes: 3

Views: 2327

Answers (5)

Eyayaw
Eyayaw

Reputation: 1081

regextract <- function(x, pattern, perl = TRUE, invert = FALSE, ...) {
  m <- gregexpr(pattern, x, perl = perl, ...) # match results
  unlist(regmatches(x, m, invert = invert))
}

txt <- unlist(strsplit(rows, "\\s{2,}"))
patterns <- c(Code = "(\\d+)", Item = "([[:alpha:],\\s]+)")
out <- lapply(patterns, regextract, x = txt)
out <- lapply(out, trimws)
out <- do.call(cbind, out)

out 

Code    Item                       
[1,] "70150" "Markers, Times, Places"   
[2,] "72588" "Times, Places, Things"    
[3,] "51256" "Items, Shelves, Cats"     
[4,] "99201" "Widget, Places, Locations"

Upvotes: 1

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193527

An alternative in base R is to use strcapture. You specify the pattern to identify columns and the prototype object that the split values should be inserted into. Since you have multiple values per vector element, you need to split that first (by multiple spaces).

pattern <- "([[:digit:]]+) (.*)"
proto <- data.frame(code = integer(), item = character())
strcapture(pattern, unlist(strsplit(rows, "\\s{2,}")), proto)
#    code                      item
# 1 70150    Markers, Times, Places
# 2 72588     Times, Places, Things
# 3 51256      Items, Shelves, Cats
# 4 99201 Widget, Places, Locations

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 388982

You can split the string on more than 2 spaces in rows and using str_match from stringr capture the information in two groups, the number part and the remaining part of the string.

new_rows <- unlist(strsplit(rows, '\\s{2,}'))
stringr::str_match(new_rows, "(\\d+)\\s*(.*)")[, -1]

#        [,1]    [,2]                       
#[1,] "70150" "Markers, Times, Places"   
#[2,] "72588" "Times, Places, Things"    
#[3,] "51256" "Items, Shelves, Cats"     
#[4,] "99201" "Widget, Places, Locations"

This returns a matrix, you can convert it to dataframe and assign proper column names if needed.

Upvotes: 1

Roman Luštrik
Roman Luštrik

Reputation: 70643

If you're inclined on using base R, here's one way of doing it.

rows <- c('70150 Markers, Times, Places    72588 Times, Places, Things',
          '51256 Items, Shelves, Cats    99201 Widget, Places, Locations')

rows <- strsplit(rows, "   ")
rows <- sapply(rows, FUN = trimws, simplify = FALSE)
rows <- unlist(rows)

ptn <- "^(\\d+) (.*)$"
data.frame(Code = gsub(ptn, replacement = "\\1", x = rows),
           Item = gsub(ptn, replacement = "\\2", x = rows))

   Code                      Item
1 70150    Markers, Times, Places
2 72588     Times, Places, Things
3 51256      Items, Shelves, Cats
4 99201 Widget, Places, Locations

Upvotes: 2

akrun
akrun

Reputation: 887118

We could use separate_rows to split the column created at the space before the digit, then separate into two columns at the first spaces

library(dplyr)
library(tidyr)
tibble(col1 = rows) %>%
     separate_rows(col1, sep="\\s+(?=[0-9])") %>%
     separate(col1, into = c("Code", "Item"), extra = 'merge')
# A tibble: 4 x 2
#  Code  Item                     
#  <chr> <chr>                    
#1 70150 Markers, Times, Places   
#2 72588 Times, Places, Things    
#3 51256 Items, Shelves, Cats     
#4 99201 Widget, Places, Locations

Upvotes: 3

Related Questions