Reputation:
I am using Kaggle's 2017 Data Science Survey Data, and am trying to look at frequencies of majors. People have inputted double majors using the format, X and Y. (Engineering Physics and Medicine). Here is a glimpse of the data:
> dput(head(major_free, 20))
c("biochemistry", "architecture", "economics", "engineering physics and medicine",
"chemistry", "software engineering", "image processing research area",
"applied mathematics", "biochemistry", "mechatronic engineering",
"sound technology", "major-graphic design; minor- asian studies",
"english literature and langauge", "bioinformatics", "biotechnology",
"electronics and communication engineering", "chemistry", "electronic with image processing and ai",
"geology", "software engineer")
> head(major_free)
[1] "biochemistry"
[2] "architecture"
[3] "economics"
[4] "engineering physics and medicine"
[5] "chemistry"
[6] "software engineering"
I want to split up the double majors into two separate majors on two separate lines ( inside a data frame). I've tried:
strsplit(major_free, "and")
This gives me a long list, and I don't know how to turn it into a dataframe I can use to graph the frequencies of the majors.
2017/11/26 EDIT:
I wanted to do the same thing, but split before and after "&", ";", etc
> major_free <- unlist(strsplit(major_free, "&"))
Error in strsplit(major_free, "&") : non-character argument
> class("&")
[1] "character"
Weird that R is not reading it as a character in strsplit
.
Upvotes: 1
Views: 1682
Reputation: 2150
The code below will take your basic strsplit
parsing and the example data provided and give you a data.frame
with a single column breaking out the double majors into two observations within a single column.
data.frame(major = unlist(strsplit(major_free, " and ")))
Though be warned that based solely on your example data, you will need to do more parsing as shown by row 13
data.frame(major = unlist(strsplit(major_free, " and ")))[13,]
[1] major-graphic design; minor- asian studies
And finally, if you do not want factors that you'll want to specific stringsAsFactors=FALSE
data.frame(major = unlist(strsplit(major_free, " and ")), stringsAsFactors=FALSE)
Upvotes: 0
Reputation: 1701
Or this, (only difference from answer by @Christoph is use of base strsplit function):
li <- c("a", "a and b", "b", "b and c")
data.frame(majors = unlist(lapply(li, strsplit, " and " )))
Upvotes: 0
Reputation: 7063
What about
li <- c("a", "a and b", "b", "b and c")
df <- stringr::str_split_fixed(li, " and ", 2)
Depending on the data, you could add somthing like df[complete.cases(df), ]
Please add a reproducible example, if this does not help.
Upvotes: 1