Reputation: 635
I have transformed a Pdf to text file and I have a data set which is constructed as follow:
data=c("Paris21London3Tokyo51San Francisco38")
And I would like to obtain the following structure:
matrix(c("Paris","London","Tokyo","San Francisco",21,3,51,38),4,2)
Does anyone have a method to do it ? Thanks
Upvotes: 2
Views: 141
Reputation: 545588
Regular expressions are the right tool here, but unlike the other answer shows, strsplit
is not well suited for the job.
Better use regular expression matches, and to have two separate expressions for words and numbers:
words = '[a-zA-Z ]+'
numbers = '[+-]?\\d+(\\.\\d+)?'
word_matches = gregexpr(words, data)
number_matches = gregexpr(numbers, data)
result = cbind(regmatches(data, word_matches)[[1]],
regmatches(data, number_matches)[[1]])
This recognises any number with an optional decimal point, and an optional sign. It does not recognise numbers in scientific (exponential) notation. This can be trivially added, if necessary.
Upvotes: 1
Reputation: 887118
You could try strsplit
with regex
lookahead
and lookbehind
v1 <- strsplit(data, '(?<=[^0-9])(?=[0-9])|(?<=[0-9])(?=[^0-9])',
perl=TRUE)[[1]]
indx <- c(TRUE, FALSE)
data.frame(Col1= v1[indx], Col2=v1[!indx])
Including decimal numbers as well
data1=c("Paris21.53London3Tokyo51San Francisco38.2")
v2 <- strsplit(data1, '(?<=[^0-9.])(?=[0-9])|(?<=[0-9])(?=[^0-9.])',
perl=TRUE)[[1]]
indx <- c(TRUE, FALSE)
data.frame(Col1= v2[indx], Col2=v2[!indx])
# Col1 Col2
#1 Paris 21.53
#2 London 3
#3 Tokyo 51
#4 San Francisco 38.2
Upvotes: 4