Reputation: 661
I have a data set that looks similar to the sample below:
rows <- c('70150 Markers, Times, Places 72588 Times, Places, Things',
'51256 Items, Shelves, Cats 99201 Widget, Places, Locations')
I need to split the strings to create useful features. My expected output would be similar to:
Code Item
70150 Markers, Times, Places
72588 Times, Places, Things
51256 Items, Shelves, Cats
99201 Widget, Places, Locations
I tried using
library(tidyverse)
rows <- c('70150 Markers, Times, Places 72588 Times, Places, Things',
'51256 Items, Shelves, Cats 99201 Widget, Places, Locations')
rows %>% parse_number
to get the number, but that only gets the first numeric value in the string.
Any suggestions on how to accomplish what I am trying to do?
Upvotes: 3
Views: 2327
Reputation: 1081
regextract <- function(x, pattern, perl = TRUE, invert = FALSE, ...) {
m <- gregexpr(pattern, x, perl = perl, ...) # match results
unlist(regmatches(x, m, invert = invert))
}
txt <- unlist(strsplit(rows, "\\s{2,}"))
patterns <- c(Code = "(\\d+)", Item = "([[:alpha:],\\s]+)")
out <- lapply(patterns, regextract, x = txt)
out <- lapply(out, trimws)
out <- do.call(cbind, out)
out
Code Item
[1,] "70150" "Markers, Times, Places"
[2,] "72588" "Times, Places, Things"
[3,] "51256" "Items, Shelves, Cats"
[4,] "99201" "Widget, Places, Locations"
Upvotes: 1
Reputation: 193527
An alternative in base R is to use strcapture
. You specify the pattern to identify columns and the prototype object that the split values should be inserted into. Since you have multiple values per vector element, you need to split that first (by multiple spaces).
pattern <- "([[:digit:]]+) (.*)"
proto <- data.frame(code = integer(), item = character())
strcapture(pattern, unlist(strsplit(rows, "\\s{2,}")), proto)
# code item
# 1 70150 Markers, Times, Places
# 2 72588 Times, Places, Things
# 3 51256 Items, Shelves, Cats
# 4 99201 Widget, Places, Locations
Upvotes: 1
Reputation: 388982
You can split the string on more than 2 spaces in rows
and using str_match
from stringr
capture the information in two groups, the number part and the remaining part of the string.
new_rows <- unlist(strsplit(rows, '\\s{2,}'))
stringr::str_match(new_rows, "(\\d+)\\s*(.*)")[, -1]
# [,1] [,2]
#[1,] "70150" "Markers, Times, Places"
#[2,] "72588" "Times, Places, Things"
#[3,] "51256" "Items, Shelves, Cats"
#[4,] "99201" "Widget, Places, Locations"
This returns a matrix, you can convert it to dataframe and assign proper column names if needed.
Upvotes: 1
Reputation: 70643
If you're inclined on using base R, here's one way of doing it.
rows <- c('70150 Markers, Times, Places 72588 Times, Places, Things',
'51256 Items, Shelves, Cats 99201 Widget, Places, Locations')
rows <- strsplit(rows, " ")
rows <- sapply(rows, FUN = trimws, simplify = FALSE)
rows <- unlist(rows)
ptn <- "^(\\d+) (.*)$"
data.frame(Code = gsub(ptn, replacement = "\\1", x = rows),
Item = gsub(ptn, replacement = "\\2", x = rows))
Code Item
1 70150 Markers, Times, Places
2 72588 Times, Places, Things
3 51256 Items, Shelves, Cats
4 99201 Widget, Places, Locations
Upvotes: 2
Reputation: 887118
We could use separate_rows
to split the column created at the space before the digit, then separate
into two columns at the first spaces
library(dplyr)
library(tidyr)
tibble(col1 = rows) %>%
separate_rows(col1, sep="\\s+(?=[0-9])") %>%
separate(col1, into = c("Code", "Item"), extra = 'merge')
# A tibble: 4 x 2
# Code Item
# <chr> <chr>
#1 70150 Markers, Times, Places
#2 72588 Times, Places, Things
#3 51256 Items, Shelves, Cats
#4 99201 Widget, Places, Locations
Upvotes: 3