TDog
TDog

Reputation: 175

How do I split a string with tidyr::separate in R and retain the values of the separator string?

I have a data set:

crimes<-data.frame(x=c("Smith", "Jones"), charges=c("murder, first degree-G, manslaughter-NG", "assault-NG, larceny, second degree-G"))

I'm using tidyr:separate to split the charges column on a match with "G,"

crimes<-separate(crimes, charges, into=c("v1","v2"), sep="G,")

This splits my columns, but removes the separator "G,". I want to retain the "G," in the resulting column split.

My desired output is:

 x         v1                       v2
 Smith     murder, first degree-G   manslaughter-NG
 Jones     assault-NG               larceny, second degree-G

Any suggestions welcome.

Upvotes: 12

Views: 11769

Answers (2)

Cameron
Cameron

Reputation: 2965

Replace <yourRegexPattern> with your Regex

If you want the 'sep' in the left column (look behind)

dataframe %>% separate(column_to_sep, into = c("newCol1", "newCol2"), sep="(?<=<yourRegexPattern>)")

If you want the 'sep' in the right column (look ahead)

dataframe %>% separate(column_to_sep, into = c("newCol1", "newCol2"), sep="(?=<yourRegexPattern>)")

Also note that when you are trying to separate a word from a group of digits (I.E. Auguest1990 to August and 1990) you will need to ensure the whole pattern gets read.

Example:

dataframe %>% separate(column_to_sep, into = c("newCol1", "newCol2"), sep="(?=[[:digit:]])", extra="merge")

Upvotes: 12

Matias Andina
Matias Andina

Reputation: 4220

UPDATE

This is what you ask for. Keep in mind that your data is not tidy (both V1 and V2 have more than one variable inside each column)

A<-separate(crimes,charges,into=c("V1","V2"),sep = "(?<=G,)")
A
      x                      V1                        V2
1 Smith murder, first degree-G,           manslaughter-NG
2 Jones             assault-NG,  larceny, second degree-G

An easier way to get keep the "G" or "NG" is to use sep=", " as said by alistaire.

A<-separate(crimes, charges, into=c("v1","v2"), sep = ', ')

This gives

      x         v1              v2
1 Smith   murder-G manslaughter-NG
2 Jones assault-NG       larceny-G

If you wanted to keep separating your data.frame (using the -)

separate(A, v1, into = c("v3","v4"), sep = "-")

that gives

      x      v3 v4              v2
1 Smith  murder  G manslaughter-NG
2 Jones assault NG       larceny-G

You'll need to do that again for the v2 column. I don't know if you want to keep separating, please post your expected output to make my answer more specific.

Upvotes: 7

Related Questions