Reputation: 13123
I have a vector of character data. Most of the elements in the vector consist of one or more letters followed by one or more numbers. I wish to split each element in the vector into the character portion and the number portion. I found a similar question on Stackoverflow.com here:
split a character from a number with multiple digits
However, the answer given above does not seem to work completely in my case or I am doing something wrong. An example vector is below:
my.data <- c("aaa", "b11", "b21", "b101", "b111", "ccc1", "ddd1", "ccc20", "ddd13")
# I can obtain the number portion using:
gsub("[^[:digit:]]", "", my.data)
# However, I cannot obtaining the character portion using:
gsub("[:digit:]", "", my.data)
How can I obtain the character portion? I am using R version 2.14.1 on a Windows 7 64-bit machine.
Upvotes: 36
Views: 62809
Reputation: 2370
In case the result should be reassigned to a single splitted string:
var <- "foo123 bar1987"
rpaste(strsplit(var, "(?=[A-Za-z])(?<=[0-9])|(?=[0-9])(?<=[A-Za-z])", perl=TRUE)[[1]], collapse = ' ')
Result:
"foo 123 bar 1987"
Or for a vectorized version where you want to reassign to a data frame:
df = data.frame(text=c("foo121", "131bar foo1516"))
res = strsplit(df$text, "(?=[A-Za-z])(?<=[0-9])|(?=[0-9])(?<=[A-Za-z])", perl=TRUE)
df$res = sapply(res, paste, collapse=" ")
Result:
text res
1 foo121 foo 121
2 131bar foo1516 131 bar foo 1516
Upvotes: 0
Reputation: 1
mydata.nub<-gsub("\ \ D","",my.data)
mydata.text<-gsub("\ \ d","",my.data)
This one is perfect, and it also separates number and text, even if there is number between the text.
Upvotes: 0
Reputation: 1037
Since none of the previous answers use tidyr::separate
here it goes:
library(tidyr)
df <- data.frame(mycol = c("APPLE348744", "BANANA77845", "OATS2647892", "EGG98586456"))
df %>%
separate(mycol,
into = c("text", "num"),
sep = "(?<=[A-Za-z])(?=[0-9])"
)
Upvotes: 35
Reputation: 522171
Late answer, but another option is to use strsplit
with a regex pattern which uses lookarounds to find the boundary between numbers and letters:
var <- "ABC123"
strsplit(var, "(?=[A-Za-z])(?<=[0-9])|(?=[0-9])(?<=[A-Za-z])", perl=TRUE)
[[1]]
[1] "ABC" "123"
The above pattern will match (but not consume) when either the previous character is a letter and the following character is a number, or vice-versa. Note that we use strsplit
in Perl mode to access lookarounds.
Upvotes: 12
Reputation: 18691
You can also use colsplit
from reshape2
to split your vector into character and digit columns in one step:
library(reshape2)
colsplit(my.data, "(?<=\\p{L})(?=[\\d+$])", c("char", "digit"))
Result:
char digit
1 aaa NA
2 b 11
3 b 21
4 b 101
5 b 111
6 ccc 1
7 ddd 1
8 ccc 20
9 ddd 13
Data:
my.data <- c("aaa", "b11", "b21", "b101", "b111", "ccc1", "ddd1", "ccc20", "ddd13")
Upvotes: 1
Reputation: 3298
A slightly more elegant way (without any external packages):
> x = c("aaa", "b11", "b21", "b101", "b111", "ccc1", "ddd1", "ccc20", "ddd13")
> gsub('\\D','', x) # replaces non-digits with blancs
[1] "" "11" "21" "101" "111" "1" "1" "20" "13"
> gsub('\\d','', x) # replaces digits with blanks
[1] "aaa" "b" "b" "b" "b" "ccc" "ddd" "ccc" "ddd"
Upvotes: 6
Reputation: 42313
With stringr
, if you like (and slightly different from the answer to the other question):
# load library
library(stringr)
#
# load data
my.data <- c("aaa", "b11", "b21", "b101", "b111", "ccc1", "ddd1", "ccc20", "ddd13")
#
# extract numbers only
my.data.num <- as.numeric(str_extract(my.data, "[0-9]+"))
#
# check output
my.data.num
[1] NA 11 21 101 111 1 1 20 13
#
# extract characters only
my.data.cha <- (str_extract(my.data, "[aA-zZ]+"))
#
# check output
my.data.cha
[1] "aaa" "b" "b" "b" "b" "ccc" "ddd" "ccc" "ddd"
Upvotes: 19
Reputation: 56935
For your regex you have to use:
gsub("[[:digit:]]","",my.data)
The [:digit:]
character class only makes sense inside a set of []
.
Upvotes: 25