Split an uneven column in a dataframe into multiple columns in R

Question

I have a dataframe A like below where column Info has some missing information like Sample2 doesn't have any color white or black which you see for others:

Reproducible example:

A <- structure(list(Sample = structure(1:7, .Label = c("Sample1", 
"Sample2", "Sample3", "Sample4", "Sample5", "Sample6", "Sample7"
), class = "factor"), Description = structure(c(7L, 3L, 4L, 2L, 
6L, 1L, 5L), .Label = c("37 years, female, white, alive, 257 days", 
"43 years, male, white, stage:iiic, alive, 598 days", "53 years, male, stage:iiib, alive, 792 days", 
"68 years, female, white, stage:iiic, dead, 740 days", "69 years, female, black or african american, stage:iia, alive, 627 days", 
"74 years, white, stage:i, alive, 1001 days", "82 years, female, white, stage:iiib, alive, 1419 days"
), class = "factor")), class = "data.frame", row.names = c(NA, 
-7L))

Dataframe A looks like below:

Sample  Info
Sample1 82 years, female, white, stage:iiib, alive, 1419 days
Sample2 53 years, male, stage:iiib, alive, 792 days
Sample3 68 years, female, white, stage:iiic, dead, 740 days
Sample4 43 years, male, white, stage:iiic, alive, 598 days
Sample5 74 years, white, stage:i, alive, 1001 days
Sample6 37 years, female, white, alive, 257 days
Sample7 69 years, female, black, stage:iia, alive, 627 days

To separate the Info column into multiple columns I used separate function like below

library(dplyr)
library(tidyr)
A2 <- separate(A, 'Info', paste("Info", 1:6, sep="_"), sep=",", extra="drop")

But the new columns looks uneven like below:

Sample  Info_1     Info_2   Info_3    Info_4    Info_5    Info_6
Sample1 82 years   female    white   stage:iiib  alive   1419 days
Sample2 53 years    male  stage:iiib    alive   792 days    NA
Sample3 68 years   female    white   stage:iiic  dead    740 days
Sample4 43 years    male     white   stage:iiic  alive   598 days
Sample5 74 years   white    stage:i    alive     1001 days  NA
Sample6 37 years   female    white     alive     257 days   NA
Sample7 69 years   female    black   stage:iia   alive   627 days

I want the output to be looked like below where the missing information need to be empty space or NA and the last column showing only numbers without any word days in it:

Sample  Info_1     Info_2   Info_3    Info_4    Info_5   Info_6
Sample1 82 years   female    white   stage:iiib  alive   1419
Sample2 53 years    male             stage:iiib  alive   792    
Sample3 68 years   female    white   stage:iiic  dead    740
Sample4 43 years    male     white   stage:iiic  alive   598
Sample5 74 years             white    stage:i    alive   1001
Sample6 37 years   female    white               alive   257
Sample7 69 years   female    black   stage:iia   alive   627

Any help is appreciated. thanq

G. Grothendieck · Accepted Answer

Using the data shown reproducibly in the Note at the end we can use read.pattern with the indicated pattern pat and then remove junk columns (every other column). The lines marked ## can be omitted if you don't require the column names to be exactly as in the question.

library(gsubfn)

pat <- 
"((\d+ years), )?((female|male), )?((white|black), )?((stage:\S+), )?((alive|dead), )?((\d+) days)?"
r <- read.pattern(text = as.character(DF$Info), pattern = pat, as.is = TRUE)
DF2 <- cbind(Sample = DF$Sample, r[c(FALSE, TRUE)], stringsAsFactors = FALSE)

nc <- ncol(DF2) ## 
names(DF2)[-1] <- paste0("Info_", 1:(nc-1)) ##

DF2

giving:

   Sample   Info_1 Info_2 Info_3     Info_4 Info_5 Info_6
1 Sample1 82 years female  white stage:iiib  alive   1419
2 Sample2 53 years   male        stage:iiib  alive    792
3 Sample3 68 years female  white stage:iiic   dead    740
4 Sample4 43 years   male  white stage:iiic  alive    598
5 Sample5 74 years         white    stage:i  alive   1001
6 Sample6 37 years female  white             alive    257
7 Sample7 69 years female  black  stage:iia  alive    627

Note

The input DF in reproducible form is as follows.

Lines <- "
Sample;Info
Sample1;82 years, female, white, stage:iiib, alive, 1419 days
Sample2;53 years, male, stage:iiib, alive, 792 days
Sample3;68 years, female, white, stage:iiic, dead, 740 days
Sample4;43 years, male, white, stage:iiic, alive, 598 days
Sample5;74 years, white, stage:i, alive, 1001 days
Sample6;37 years, female, white, alive, 257 days
Sample7;69 years, female, black, stage:iia, alive, 627 days"

DF <- read.table(text = Lines, header = TRUE, sep = ";", as.is = TRUE, strip.white = TRUE)

Split an uneven column in a dataframe into multiple columns in R

Answers (2)

Note

Related Questions