Maurice O'Neill
Maurice O'Neill

Reputation: 13

How to separate a column in R with unequal character lengths and without separators

I want to separate a column which contains dates and items into two columns.

V1
23/2/2000shampoo
24/2/2000flour
21/10/2000poultry
17/4/2001laundry detergent

To this

V1           V2
23/2/2000    shampoo
24/2/2000    flour
21/10/2000   poultry
17/4/2001    laundry detergent

My problem is that there's no separation between the two. The date length isn't uniform (it's in the format of 1/1/2000 instead of 01/01/2000) so I can't separate by character length. The dataset also covers multiple years.

Upvotes: 1

Views: 198

Answers (3)

Andrew
Andrew

Reputation: 5138

You could also use capture groups with tidyr::extract(). The first group \\d{1,2}/\\d{1,2}/\\d{4} get the date in the format you posted, and the second group [[:print:]]+ grabs at least one printable character.

library(tidyverse)

df1 %>%
  extract(V1, c("V1", "V2"), "(\\d{1,2}/\\d{1,2}/\\d{4})([[:print:]]+)")
          V1                V2
1  23/2/2000           shampoo
2  24/2/2000             flour
3 21/10/2000           poultry
4  17/4/2001 laundry detergent

Data:

df1 <- readr::read_csv("V1
23/2/2000shampoo
24/2/2000flour
21/10/2000poultry
17/4/2001laundry detergent")

Upvotes: 2

Gainz
Gainz

Reputation: 1771

You can also use :

data <- data.frame(V1 = c("23-02-2000shampoo", "24-02-2001flour"))

library(stringr)
str_split_fixed(data$V1, "(?<=[0-9])(?=[a-z])", 2)

     [,1]         [,2]     
[1,] "23-02-2000" "shampoo"
[2,] "24-02-2001" "flour"  

Upvotes: 1

akrun
akrun

Reputation: 887118

One option would be separate from tidyr. We specify the sep with a regex lookaround to split between digit and a lower case letter

library(dplyr)
library(tidyr)
df1 %>%
   separate(V1, into = c("V1", "V2"), sep="(?<=[0-9])(?=[a-z])")
#        V1                V2
#1  23/2/2000           shampoo
#2  24/2/2000             flour
#3 21/10/2000           poultry
#4  17/4/2001 laundry detergent

Or with read.csv after creating a delimiter with sub

read.csv(text = sub("(\\d)([a-z])", "\\1,\\2", df1$V1), 
         header = FALSE, stringsAsFactors = FALSE)

data

df1 <- structure(list(V1 = c("23/2/2000shampoo", "24/2/2000flour", 
      "21/10/2000poultry", 
"17/4/2001laundry detergent")), class = "data.frame", row.names = c(NA, 
-4L))

Upvotes: 4

Related Questions