sevpants
sevpants

Reputation: 365

Split columns by number in a dataframe

I'm trying to separate a column in a rather untidy dataframe.

section
View 500
V458
453

And I want to create a new column from this. With the preferred output like below.

section  section numbers  
View     500
V        458
         453

I've been trying to research it but I'm having a time with it. I can separate them in the case of the first row, because I can use regex like this.

df_split <- separate(df, col = section, into = c("section", "section_number"), sep = " +[1-9]")

But I can't seem to find a way to use an "or" type statement. If anyone has any input that would be wonderful.

Upvotes: 6

Views: 489

Answers (5)

akrun
akrun

Reputation: 886948

Here is an option using base R with read.csv and sub. We capture the numbers at the end as a group ((\\d+)$) and replace with a comma and the backreference for the group (\\1) in the sub, and read it with read.csv

read.csv(text=sub("\\s*(\\d+)$", ",\\1", df1$section), fill=TRUE, header=FALSE, 
         col.names = c("section", "section number"), stringsAsFactors=FALSE)
#   section section.number
#1    View            500
#2       V            458
#3                    453

Upvotes: 1

Sandipan Dey
Sandipan Dey

Reputation: 23101

with stringr (assuming the original df has just one column named section):

library(stringr)
df_split <- as.data.frame(str_match(df$section, "([A-Za-z]*)\\s*([0-9]*)")[,2:3])
names(df_split) <- c('section', 'section numbers')
df_split

#  section section numbers
#1    View             500
#2       V             458
#3                     453

Upvotes: 0

HubertL
HubertL

Reputation: 19544

You can use tidyr for this:

tidyr::extract(df,section, c("section", "section number"), 
               regex="([[:alpha:]]*)[[:space:]]*([[:digit:]]*)")
  section section number
1    View            500
2       V            458
3                    453

Upvotes: 5

LyzandeR
LyzandeR

Reputation: 37879

Using a simple gsub would be a choice for me:

section <- c('View 500', 'V458', '453')

cbind(section = trimws(gsub('[0-9]', '', section)), 
      section_numbers = trimws(gsub('[a-zA-Z]', '', section)))

I use trimws to just remove any unwanted white spaces.

Output:

    section section_numbers
[1,] "View"  "500"          
[2,] "V"     "458"          
[3,] ""      "453" 

Upvotes: 7

akuiper
akuiper

Reputation: 214927

You can use extract which also comes from the tidyr package, with which you can specify the capture group, make them optional here and it is pretty flexible to handle different cases:

library(tidyr)
df %>% extract(section, into = c("alpha", "numeric"), regex = "([a-zA-Z]+)?\\s?(\\d+)?")

#  alpha numeric
#1  View     500
#2     V     458
#3  <NA>     453

Upvotes: 3

Related Questions