stack_learner
stack_learner

Reputation: 263

Split an uneven column in a dataframe into multiple columns in R

I have a dataframe A like below where column Info has some missing information like Sample2 doesn't have any color white or black which you see for others:

Reproducible example:

A <- structure(list(Sample = structure(1:7, .Label = c("Sample1", 
"Sample2", "Sample3", "Sample4", "Sample5", "Sample6", "Sample7"
), class = "factor"), Description = structure(c(7L, 3L, 4L, 2L, 
6L, 1L, 5L), .Label = c("37 years, female, white, alive, 257 days", 
"43 years, male, white, stage:iiic, alive, 598 days", "53 years, male, stage:iiib, alive, 792 days", 
"68 years, female, white, stage:iiic, dead, 740 days", "69 years, female, black or african american, stage:iia, alive, 627 days", 
"74 years, white, stage:i, alive, 1001 days", "82 years, female, white, stage:iiib, alive, 1419 days"
), class = "factor")), class = "data.frame", row.names = c(NA, 
-7L))

Dataframe A looks like below:

Sample  Info
Sample1 82 years, female, white, stage:iiib, alive, 1419 days
Sample2 53 years, male, stage:iiib, alive, 792 days
Sample3 68 years, female, white, stage:iiic, dead, 740 days
Sample4 43 years, male, white, stage:iiic, alive, 598 days
Sample5 74 years, white, stage:i, alive, 1001 days
Sample6 37 years, female, white, alive, 257 days
Sample7 69 years, female, black, stage:iia, alive, 627 days

To separate the Info column into multiple columns I used separate function like below

library(dplyr)
library(tidyr)
A2 <- separate(A, 'Info', paste("Info", 1:6, sep="_"), sep=",", extra="drop")

But the new columns looks uneven like below:

Sample  Info_1     Info_2   Info_3    Info_4    Info_5    Info_6
Sample1 82 years   female    white   stage:iiib  alive   1419 days
Sample2 53 years    male  stage:iiib    alive   792 days    NA
Sample3 68 years   female    white   stage:iiic  dead    740 days
Sample4 43 years    male     white   stage:iiic  alive   598 days
Sample5 74 years   white    stage:i    alive     1001 days  NA
Sample6 37 years   female    white     alive     257 days   NA
Sample7 69 years   female    black   stage:iia   alive   627 days

I want the output to be looked like below where the missing information need to be empty space or NA and the last column showing only numbers without any word days in it:

Sample  Info_1     Info_2   Info_3    Info_4    Info_5   Info_6
Sample1 82 years   female    white   stage:iiib  alive   1419
Sample2 53 years    male             stage:iiib  alive   792    
Sample3 68 years   female    white   stage:iiic  dead    740
Sample4 43 years    male     white   stage:iiic  alive   598
Sample5 74 years             white    stage:i    alive   1001
Sample6 37 years   female    white               alive   257
Sample7 69 years   female    black   stage:iia   alive   627

Any help is appreciated. thanq

Upvotes: 2

Views: 784

Answers (2)

G. Grothendieck
G. Grothendieck

Reputation: 269471

Using the data shown reproducibly in the Note at the end we can use read.pattern with the indicated pattern pat and then remove junk columns (every other column). The lines marked ## can be omitted if you don't require the column names to be exactly as in the question.

library(gsubfn)

pat <- 
"((\\d+ years), )?((female|male), )?((white|black), )?((stage:\\S+), )?((alive|dead), )?((\\d+) days)?"
r <- read.pattern(text = as.character(DF$Info), pattern = pat, as.is = TRUE)
DF2 <- cbind(Sample = DF$Sample, r[c(FALSE, TRUE)], stringsAsFactors = FALSE)

nc <- ncol(DF2) ## 
names(DF2)[-1] <- paste0("Info_", 1:(nc-1)) ##

DF2

giving:

   Sample   Info_1 Info_2 Info_3     Info_4 Info_5 Info_6
1 Sample1 82 years female  white stage:iiib  alive   1419
2 Sample2 53 years   male        stage:iiib  alive    792
3 Sample3 68 years female  white stage:iiic   dead    740
4 Sample4 43 years   male  white stage:iiic  alive    598
5 Sample5 74 years         white    stage:i  alive   1001
6 Sample6 37 years female  white             alive    257
7 Sample7 69 years female  black  stage:iia  alive    627

Note

The input DF in reproducible form is as follows.

Lines <- "
Sample;Info
Sample1;82 years, female, white, stage:iiib, alive, 1419 days
Sample2;53 years, male, stage:iiib, alive, 792 days
Sample3;68 years, female, white, stage:iiic, dead, 740 days
Sample4;43 years, male, white, stage:iiic, alive, 598 days
Sample5;74 years, white, stage:i, alive, 1001 days
Sample6;37 years, female, white, alive, 257 days
Sample7;69 years, female, black, stage:iia, alive, 627 days"

DF <- read.table(text = Lines, header = TRUE, sep = ";", as.is = TRUE, strip.white = TRUE)

Upvotes: 3

www
www

Reputation: 39154

Here is one idea. I am sure there is a better way, but here is a start.

We can use extract from tidyr to split the columns starting with the complete records. The key is to set identifiable and meaningful column names. After this, we drop the rows with NA and remove them from the original data frame. And then we can do the extract again assuming some columns is missing. Eventually, we can satisfy all the missing condition and separate them properly. The last step is to combine all the subset data frames.

If you have a lot of different missing column components, this method may be too tedius. However, if you are certain what could be missing. We can design a function and wrap all these steps in the function.

library(tidyverse)

# Complete rows
dat2 <- dat %>%
  extract(Info, into = c("Year", "Sex", "Race", "Stage", "Status", "Days"),
          regex = "([0-9]* years), (male|female), (black|white), (stage\\:i[A-Za-z]*), (dead|alive), ([0-9]*)") %>%
  drop_na(Year)

dat <- dat %>% anti_join(dat2, by = "Sample")

# Record with no race
dat3 <- dat %>%
  extract(Info, into = c("Year", "Sex", "Stage", "Status", "Days"),
          regex = "([0-9]* years), (male|female), (stage\\:i[A-Za-z]*), (dead|alive), ([0-9]*)") %>%
  drop_na(Year)

dat <- dat %>% anti_join(dat3, by = "Sample")

# Record with no sex
dat4 <- dat %>%
  extract(Info, into = c("Year", "Race", "Stage", "Status", "Days"),
          regex = "([0-9]* years), (black|white), (stage\\:i[A-Za-z]*), (dead|alive), ([0-9]*)") %>%
  drop_na(Year)

dat <- dat %>% anti_join(dat4, by = "Sample")

# Record with no stage
dat5 <- dat %>%
  extract(Info, into = c("Year", "Sex", "Race", "Status", "Days"),
          regex = "([0-9]* years), (male|female), (black|white), (dead|alive), ([0-9]*)") %>%
  drop_na(Year)

dat <- dat %>% anti_join(dat5, by = "Sample")

# Combine all subset data frame
dat_new <- bind_rows(dat2, dat3, dat4, dat5) %>%
  arrange(Sample)
dat_new
#    Sample     Year    Sex  Race      Stage Status Days
# 1 Sample1 82 years female white stage:iiib  alive 1419
# 2 Sample2 53 years   male  <NA> stage:iiib  alive  792
# 3 Sample3 68 years female white stage:iiic   dead  740
# 4 Sample4 43 years   male white stage:iiic  alive  598
# 5 Sample5 74 years   <NA> white    stage:i  alive 1001
# 6 Sample6 37 years female white       <NA>  alive  257
# 7 Sample7 69 years female black  stage:iia  alive  627

DATA

dat <- read.table(text = "Sample  Info
Sample1 '82 years, female, white, stage:iiib, alive, 1419 days'
                  Sample2 '53 years, male, stage:iiib, alive, 792 days'
                  Sample3 '68 years, female, white, stage:iiic, dead, 740 days'
                  Sample4 '43 years, male, white, stage:iiic, alive, 598 days'
                  Sample5 '74 years, white, stage:i, alive, 1001 days'
                  Sample6 '37 years, female, white, alive, 257 days'
                  Sample7 '69 years, female, black, stage:iia, alive, 627 days'",
                  header = TRUE, stringsAsFactors = FALSE)

Upvotes: 2

Related Questions