user8248672
user8248672

Reputation:

Split strings before and after a word in R

I am using Kaggle's 2017 Data Science Survey Data, and am trying to look at frequencies of majors. People have inputted double majors using the format, X and Y. (Engineering Physics and Medicine). Here is a glimpse of the data:

> dput(head(major_free, 20))
c("biochemistry", "architecture", "economics", "engineering physics and medicine", 
"chemistry", "software engineering", "image processing research area", 
"applied mathematics", "biochemistry", "mechatronic engineering", 
"sound technology", "major-graphic design; minor- asian studies", 
"english literature and langauge", "bioinformatics", "biotechnology", 
"electronics and communication engineering", "chemistry", "electronic with image processing and ai", 
"geology", "software engineer")

> head(major_free)
[1] "biochemistry"                    
[2] "architecture"                    
[3] "economics"                       
[4] "engineering physics and medicine"
[5] "chemistry"                       
[6] "software engineering"  

I want to split up the double majors into two separate majors on two separate lines ( inside a data frame). I've tried:

strsplit(major_free, "and")

This gives me a long list, and I don't know how to turn it into a dataframe I can use to graph the frequencies of the majors.

2017/11/26 EDIT:

I wanted to do the same thing, but split before and after "&", ";", etc

> major_free <- unlist(strsplit(major_free, "&"))
Error in strsplit(major_free, "&") : non-character argument
> class("&")
[1] "character"

Weird that R is not reading it as a character in strsplit.

Upvotes: 1

Views: 1682

Answers (3)

jmuhlenkamp
jmuhlenkamp

Reputation: 2150

The code below will take your basic strsplit parsing and the example data provided and give you a data.frame with a single column breaking out the double majors into two observations within a single column.

data.frame(major = unlist(strsplit(major_free, " and ")))

Though be warned that based solely on your example data, you will need to do more parsing as shown by row 13

data.frame(major = unlist(strsplit(major_free, " and ")))[13,]
[1] major-graphic design; minor- asian studies

And finally, if you do not want factors that you'll want to specific stringsAsFactors=FALSE

data.frame(major = unlist(strsplit(major_free, " and ")), stringsAsFactors=FALSE)

Upvotes: 0

Eric
Eric

Reputation: 1701

Or this, (only difference from answer by @Christoph is use of base strsplit function):

 li <- c("a", "a and b", "b", "b and c")
 data.frame(majors = unlist(lapply(li, strsplit, " and " )))

Upvotes: 0

Christoph
Christoph

Reputation: 7063

What about

li <- c("a", "a and b", "b", "b and c")
df <- stringr::str_split_fixed(li, " and ", 2)

Depending on the data, you could add somthing like df[complete.cases(df), ] Please add a reproducible example, if this does not help.

Upvotes: 1

Related Questions