Reputation: 263
I have a dataframe A
like below where column Info
has some missing information like Sample2
doesn't have any color white or black which you see for others:
Reproducible example:
A <- structure(list(Sample = structure(1:7, .Label = c("Sample1",
"Sample2", "Sample3", "Sample4", "Sample5", "Sample6", "Sample7"
), class = "factor"), Description = structure(c(7L, 3L, 4L, 2L,
6L, 1L, 5L), .Label = c("37 years, female, white, alive, 257 days",
"43 years, male, white, stage:iiic, alive, 598 days", "53 years, male, stage:iiib, alive, 792 days",
"68 years, female, white, stage:iiic, dead, 740 days", "69 years, female, black or african american, stage:iia, alive, 627 days",
"74 years, white, stage:i, alive, 1001 days", "82 years, female, white, stage:iiib, alive, 1419 days"
), class = "factor")), class = "data.frame", row.names = c(NA,
-7L))
Dataframe A
looks like below:
Sample Info
Sample1 82 years, female, white, stage:iiib, alive, 1419 days
Sample2 53 years, male, stage:iiib, alive, 792 days
Sample3 68 years, female, white, stage:iiic, dead, 740 days
Sample4 43 years, male, white, stage:iiic, alive, 598 days
Sample5 74 years, white, stage:i, alive, 1001 days
Sample6 37 years, female, white, alive, 257 days
Sample7 69 years, female, black, stage:iia, alive, 627 days
To separate the Info
column into multiple columns I used separate
function like below
library(dplyr)
library(tidyr)
A2 <- separate(A, 'Info', paste("Info", 1:6, sep="_"), sep=",", extra="drop")
But the new columns looks uneven like below:
Sample Info_1 Info_2 Info_3 Info_4 Info_5 Info_6
Sample1 82 years female white stage:iiib alive 1419 days
Sample2 53 years male stage:iiib alive 792 days NA
Sample3 68 years female white stage:iiic dead 740 days
Sample4 43 years male white stage:iiic alive 598 days
Sample5 74 years white stage:i alive 1001 days NA
Sample6 37 years female white alive 257 days NA
Sample7 69 years female black stage:iia alive 627 days
I want the output
to be looked like below where the missing information need to be empty space or NA and the last column showing only numbers without any word days
in it:
Sample Info_1 Info_2 Info_3 Info_4 Info_5 Info_6
Sample1 82 years female white stage:iiib alive 1419
Sample2 53 years male stage:iiib alive 792
Sample3 68 years female white stage:iiic dead 740
Sample4 43 years male white stage:iiic alive 598
Sample5 74 years white stage:i alive 1001
Sample6 37 years female white alive 257
Sample7 69 years female black stage:iia alive 627
Any help is appreciated. thanq
Upvotes: 2
Views: 784
Reputation: 269471
Using the data shown reproducibly in the Note at the end we can use read.pattern
with the indicated pattern pat
and then remove junk columns (every other column). The lines marked ## can be omitted if you don't require the column names to be exactly as in the question.
library(gsubfn)
pat <-
"((\\d+ years), )?((female|male), )?((white|black), )?((stage:\\S+), )?((alive|dead), )?((\\d+) days)?"
r <- read.pattern(text = as.character(DF$Info), pattern = pat, as.is = TRUE)
DF2 <- cbind(Sample = DF$Sample, r[c(FALSE, TRUE)], stringsAsFactors = FALSE)
nc <- ncol(DF2) ##
names(DF2)[-1] <- paste0("Info_", 1:(nc-1)) ##
DF2
giving:
Sample Info_1 Info_2 Info_3 Info_4 Info_5 Info_6
1 Sample1 82 years female white stage:iiib alive 1419
2 Sample2 53 years male stage:iiib alive 792
3 Sample3 68 years female white stage:iiic dead 740
4 Sample4 43 years male white stage:iiic alive 598
5 Sample5 74 years white stage:i alive 1001
6 Sample6 37 years female white alive 257
7 Sample7 69 years female black stage:iia alive 627
The input DF
in reproducible form is as follows.
Lines <- "
Sample;Info
Sample1;82 years, female, white, stage:iiib, alive, 1419 days
Sample2;53 years, male, stage:iiib, alive, 792 days
Sample3;68 years, female, white, stage:iiic, dead, 740 days
Sample4;43 years, male, white, stage:iiic, alive, 598 days
Sample5;74 years, white, stage:i, alive, 1001 days
Sample6;37 years, female, white, alive, 257 days
Sample7;69 years, female, black, stage:iia, alive, 627 days"
DF <- read.table(text = Lines, header = TRUE, sep = ";", as.is = TRUE, strip.white = TRUE)
Upvotes: 3
Reputation: 39154
Here is one idea. I am sure there is a better way, but here is a start.
We can use extract
from tidyr
to split the columns starting with the complete records. The key is to set identifiable and meaningful column names. After this, we drop the rows with NA
and remove them from the original data frame. And then we can do the extract
again assuming some columns is missing. Eventually, we can satisfy all the missing condition and separate them properly. The last step is to combine all the subset data frames.
If you have a lot of different missing column components, this method may be too tedius. However, if you are certain what could be missing. We can design a function and wrap all these steps in the function.
library(tidyverse)
# Complete rows
dat2 <- dat %>%
extract(Info, into = c("Year", "Sex", "Race", "Stage", "Status", "Days"),
regex = "([0-9]* years), (male|female), (black|white), (stage\\:i[A-Za-z]*), (dead|alive), ([0-9]*)") %>%
drop_na(Year)
dat <- dat %>% anti_join(dat2, by = "Sample")
# Record with no race
dat3 <- dat %>%
extract(Info, into = c("Year", "Sex", "Stage", "Status", "Days"),
regex = "([0-9]* years), (male|female), (stage\\:i[A-Za-z]*), (dead|alive), ([0-9]*)") %>%
drop_na(Year)
dat <- dat %>% anti_join(dat3, by = "Sample")
# Record with no sex
dat4 <- dat %>%
extract(Info, into = c("Year", "Race", "Stage", "Status", "Days"),
regex = "([0-9]* years), (black|white), (stage\\:i[A-Za-z]*), (dead|alive), ([0-9]*)") %>%
drop_na(Year)
dat <- dat %>% anti_join(dat4, by = "Sample")
# Record with no stage
dat5 <- dat %>%
extract(Info, into = c("Year", "Sex", "Race", "Status", "Days"),
regex = "([0-9]* years), (male|female), (black|white), (dead|alive), ([0-9]*)") %>%
drop_na(Year)
dat <- dat %>% anti_join(dat5, by = "Sample")
# Combine all subset data frame
dat_new <- bind_rows(dat2, dat3, dat4, dat5) %>%
arrange(Sample)
dat_new
# Sample Year Sex Race Stage Status Days
# 1 Sample1 82 years female white stage:iiib alive 1419
# 2 Sample2 53 years male <NA> stage:iiib alive 792
# 3 Sample3 68 years female white stage:iiic dead 740
# 4 Sample4 43 years male white stage:iiic alive 598
# 5 Sample5 74 years <NA> white stage:i alive 1001
# 6 Sample6 37 years female white <NA> alive 257
# 7 Sample7 69 years female black stage:iia alive 627
DATA
dat <- read.table(text = "Sample Info
Sample1 '82 years, female, white, stage:iiib, alive, 1419 days'
Sample2 '53 years, male, stage:iiib, alive, 792 days'
Sample3 '68 years, female, white, stage:iiic, dead, 740 days'
Sample4 '43 years, male, white, stage:iiic, alive, 598 days'
Sample5 '74 years, white, stage:i, alive, 1001 days'
Sample6 '37 years, female, white, alive, 257 days'
Sample7 '69 years, female, black, stage:iia, alive, 627 days'",
header = TRUE, stringsAsFactors = FALSE)
Upvotes: 2