Reputation: 2067
I am trying to split my data into new columns based on |
. For example I have this observation:
fdic : Federal Deposit Insurance Corp | unbco : United Bancorp Inc Ohio
Which I would like to split into two columns based on |
. However some observations has no separator and some has more that 2 separators and using separate
from tidyr
is not possible. I have the following line as.data.frame(do.call(rbind, strsplit(xx$CO, "\\|")))
- which almost does what I want but it repeates the observations when it separates.
That is;
The first observation.
evgnl : Evogene Limited | monsan : Monsanto Company
gets split correctly for column 1 and 2 but it repeates column 1.
evgnl : Evogene Limited monsan : Monsanto Company evgnl : Evogene Limited
I would like these observations to have an NA
value.
evgnl : Evogene Limited monsan : Monsanto Company NA
Data:
structure(list(grp = c("10163", "8518", "2533", "6604", "7984",
"10689", "1911", "8092", "3091", "10878", "2193", "102", "214",
"4486", "8789", "8352", "10769", "10366", "6406", "8634"), WC = c(" 2,685 words ",
" 632 words ", " 139 words ", " 359 words ", " 3,610 words ",
" 448 words ", " 185 words ", " 2,321 words ", " 192 words ",
" 830 words ", " 803 words ", " 4,697 words ", " 4,649 words ",
" 748 words ", " 1,029 words ", " 3,125 words ", " 44 words ",
" 3,212 words ", " 1,150 words ", " 774 words "), CO = c(" evgnl : Evogene Limited | monsan : Monsanto Company ",
" codvbc : Codorus Valley Bancorp Inc ", " blycon : Blyth Inc ",
" icfcns : ICF International Inc. ", " fossil : Fossil Group Inc ",
" jpmsi : JP Morgan Securities LLC | rganus : Reinsurance Group of America Inc | cnyc : JPMorgan Chase & Co. ",
" usxmar : US Steel Corp ", "NULL", " toro : The Toro Company ",
" casms : CAS Medical Systems Inc ", " fdic : Federal Deposit Insurance Corp | unbco : United Bancorp Inc Ohio ",
" crane : Crane Co ", " pplres : PPL Corp ", " unnatf : United Natural Foods Inc ",
" intgxc : IntelGenx Technologies Corp. ", " gordmi : Gordmans Stores, Inc. | scp : Sun Capital Partners Inc ",
"NULL", " crginc : Cargill, Inc. ", "NULL", " cytmxt : CytomX Therapeutics, Inc. "
)), class = "data.frame", row.names = c(NA, -20L))
Upvotes: 1
Views: 54
Reputation: 24149
Here is a one liner using dplyr and tidyr.
The key is use the separate_rows
function to separation into an indeterminate number of rows and then to pivot_wider
to convert back to the desired dataframe.
library(tidyr)
library(dplyr)
df %>% separate_rows(CO, sep="\\|") %>%
group_by(grp, WC) %>%
mutate(ColID=row_number()) %>%
pivot_wider(id_cols=c(grp, WC), names_from = ColID, values_from = CO)
Upvotes: 1
Reputation: 6246
data.table::tstrsplit
allows you to do that with argument fixed = FALSE
:
library(data.table)
setDT(df)
df[,tstrsplit(CO, "\\|", fixed = FALSE)]
V1 V2
1: evgnl : Evogene Limited monsan : Monsanto Company
2: codvbc : Codorus Valley Bancorp Inc <NA>
3: blycon : Blyth Inc <NA>
4: icfcns : ICF International Inc. <NA>
5: fossil : Fossil Group Inc <NA>
6: jpmsi : JP Morgan Securities LLC rganus : Reinsurance Group of America Inc
7: usxmar : US Steel Corp <NA>
8: NULL <NA>
9: toro : The Toro Company <NA>
10: casms : CAS Medical Systems Inc <NA>
11: fdic : Federal Deposit Insurance Corp unbco : United Bancorp Inc Ohio
12: crane : Crane Co <NA>
13: pplres : PPL Corp <NA>
14: unnatf : United Natural Foods Inc <NA>
15: intgxc : IntelGenx Technologies Corp. <NA>
16: gordmi : Gordmans Stores, Inc. scp : Sun Capital Partners Inc
17: NULL <NA>
18: crginc : Cargill, Inc. <NA>
19: NULL <NA>
20: cytmxt : CytomX Therapeutics, Inc. <NA>
V3
1: <NA>
2: <NA>
3: <NA>
4: <NA>
5: <NA>
6: cnyc : JPMorgan Chase & Co.
7: <NA>
8: <NA>
9: <NA>
10: <NA>
11: <NA>
12: <NA>
13: <NA>
14: <NA>
15: <NA>
16: <NA>
17: <NA>
18: <NA>
19: <NA>
20: <NA>
You end-up with a data.table
object (enhanced data.frame
)
You can also use stringr and endup with a matrix:
stringr::str_split(df$CO, "\\|", simplify = TRUE)
[,1] [,2]
[1,] " evgnl : Evogene Limited " " monsan : Monsanto Company "
[2,] " codvbc : Codorus Valley Bancorp Inc " ""
[3,] " blycon : Blyth Inc " ""
[4,] " icfcns : ICF International Inc. " ""
[5,] " fossil : Fossil Group Inc " ""
[6,] " jpmsi : JP Morgan Securities LLC " " rganus : Reinsurance Group of America Inc "
[7,] " usxmar : US Steel Corp " ""
[8,] "NULL" ""
[9,] " toro : The Toro Company " ""
[10,] " casms : CAS Medical Systems Inc " ""
[11,] " fdic : Federal Deposit Insurance Corp " " unbco : United Bancorp Inc Ohio "
[12,] " crane : Crane Co " ""
[13,] " pplres : PPL Corp " ""
[14,] " unnatf : United Natural Foods Inc " ""
[15,] " intgxc : IntelGenx Technologies Corp. " ""
[16,] " gordmi : Gordmans Stores, Inc. " " scp : Sun Capital Partners Inc "
[17,] "NULL" ""
[18,] " crginc : Cargill, Inc. " ""
[19,] "NULL" ""
[20,] " cytmxt : CytomX Therapeutics, Inc. " ""
[,3]
[1,] ""
[2,] ""
[3,] ""
[4,] ""
[5,] ""
[6,] " cnyc : JPMorgan Chase & Co. "
[7,] ""
[8,] ""
[9,] ""
[10,] ""
[11,] ""
[12,] ""
[13,] ""
[14,] ""
[15,] ""
[16,] ""
[17,] ""
[18,] ""
[19,] ""
[20,] ""
Upvotes: 1