Reputation: 365
I'm trying to separate a column in a rather untidy dataframe.
section
View 500
V458
453
And I want to create a new column from this. With the preferred output like below.
section section numbers
View 500
V 458
453
I've been trying to research it but I'm having a time with it. I can separate them in the case of the first row, because I can use regex like this.
df_split <- separate(df, col = section, into = c("section", "section_number"), sep = " +[1-9]")
But I can't seem to find a way to use an "or" type statement. If anyone has any input that would be wonderful.
Upvotes: 6
Views: 489
Reputation: 886948
Here is an option using base R
with read.csv
and sub
. We capture the numbers at the end as a group ((\\d+)$
) and replace with a comma and the backreference for the group (\\1
) in the sub
, and read it with read.csv
read.csv(text=sub("\\s*(\\d+)$", ",\\1", df1$section), fill=TRUE, header=FALSE,
col.names = c("section", "section number"), stringsAsFactors=FALSE)
# section section.number
#1 View 500
#2 V 458
#3 453
Upvotes: 1
Reputation: 23101
with stringr
(assuming the original df
has just one column named section
):
library(stringr)
df_split <- as.data.frame(str_match(df$section, "([A-Za-z]*)\\s*([0-9]*)")[,2:3])
names(df_split) <- c('section', 'section numbers')
df_split
# section section numbers
#1 View 500
#2 V 458
#3 453
Upvotes: 0
Reputation: 19544
You can use tidyr
for this:
tidyr::extract(df,section, c("section", "section number"),
regex="([[:alpha:]]*)[[:space:]]*([[:digit:]]*)")
section section number
1 View 500
2 V 458
3 453
Upvotes: 5
Reputation: 37879
Using a simple gsub
would be a choice for me:
section <- c('View 500', 'V458', '453')
cbind(section = trimws(gsub('[0-9]', '', section)),
section_numbers = trimws(gsub('[a-zA-Z]', '', section)))
I use trimws
to just remove any unwanted white spaces.
Output:
section section_numbers
[1,] "View" "500"
[2,] "V" "458"
[3,] "" "453"
Upvotes: 7
Reputation: 214927
You can use extract
which also comes from the tidyr
package, with which you can specify the capture group, make them optional here and it is pretty flexible to handle different cases:
library(tidyr)
df %>% extract(section, into = c("alpha", "numeric"), regex = "([a-zA-Z]+)?\\s?(\\d+)?")
# alpha numeric
#1 View 500
#2 V 458
#3 <NA> 453
Upvotes: 3