Splitting unseparated string and numerical variables in R

Question

I have transformed a Pdf to text file and I have a data set which is constructed as follow:

data=c("Paris21London3Tokyo51San Francisco38")

And I would like to obtain the following structure:

matrix(c("Paris","London","Tokyo","San Francisco",21,3,51,38),4,2)

Does anyone have a method to do it ? Thanks

Konrad Rudolph · Accepted Answer

Regular expressions are the right tool here, but unlike the other answer shows, strsplit is not well suited for the job.

Better use regular expression matches, and to have two separate expressions for words and numbers:

words = '[a-zA-Z ]+'
numbers = '[+-]?\d+(\.\d+)?'
word_matches = gregexpr(words, data)
number_matches = gregexpr(numbers, data)

result = cbind(regmatches(data, word_matches)[[1]],
               regmatches(data, number_matches)[[1]])

This recognises any number with an optional decimal point, and an optional sign. It does not recognise numbers in scientific (exponential) notation. This can be trivially added, if necessary.

Splitting unseparated string and numerical variables in R

Answers (2)

Update

Related Questions