Jb_Eyd
Jb_Eyd

Reputation: 635

Splitting unseparated string and numerical variables in R

I have transformed a Pdf to text file and I have a data set which is constructed as follow:

data=c("Paris21London3Tokyo51San Francisco38")

And I would like to obtain the following structure:

matrix(c("Paris","London","Tokyo","San Francisco",21,3,51,38),4,2)

Does anyone have a method to do it ? Thanks

Upvotes: 2

Views: 141

Answers (2)

Konrad Rudolph
Konrad Rudolph

Reputation: 545588

Regular expressions are the right tool here, but unlike the other answer shows, strsplit is not well suited for the job.

Better use regular expression matches, and to have two separate expressions for words and numbers:

words = '[a-zA-Z ]+'
numbers = '[+-]?\\d+(\\.\\d+)?'
word_matches = gregexpr(words, data)
number_matches = gregexpr(numbers, data)

result = cbind(regmatches(data, word_matches)[[1]],
               regmatches(data, number_matches)[[1]])

This recognises any number with an optional decimal point, and an optional sign. It does not recognise numbers in scientific (exponential) notation. This can be trivially added, if necessary.

Upvotes: 1

akrun
akrun

Reputation: 887118

You could try strsplit with regex lookahead and lookbehind

v1 <- strsplit(data, '(?<=[^0-9])(?=[0-9])|(?<=[0-9])(?=[^0-9])',
                      perl=TRUE)[[1]]
 indx <- c(TRUE, FALSE)
 data.frame(Col1= v1[indx], Col2=v1[!indx])

Update

Including decimal numbers as well

 data1=c("Paris21.53London3Tokyo51San Francisco38.2")
 v2 <- strsplit(data1, '(?<=[^0-9.])(?=[0-9])|(?<=[0-9])(?=[^0-9.])',
                         perl=TRUE)[[1]]

 indx <- c(TRUE, FALSE)
 data.frame(Col1= v2[indx], Col2=v2[!indx])
 #           Col1  Col2
 #1         Paris 21.53
 #2        London     3
 #3         Tokyo    51
 #4 San Francisco  38.2

Upvotes: 4

Related Questions