TWAndrews
TWAndrews

Reputation: 113

Split strings into columns in R where each string has a potentially different number of column entries

I've got a data frame that's got the following form

pages                         count
[page 1, page 2, page 3]      23
[page 2, page 4]              4
[page 1, page 3, page 4]      12

And what I need to do is split the first column at the commas and create enough new columns to cover the longest sequence. The result should be:

First Page      Second Page  Third Page     Count
page 1          page 2       page 3         23
page 2          page 4       null           4
page 1          page 3       page 4         12

I'm fine if the null is a zero-length string, and I can handle stripping off the brackets.

Upvotes: 6

Views: 8639

Answers (3)

G. Grothendieck
G. Grothendieck

Reputation: 269526

read.table with fill=TRUE can fill them in. The names(DF2)<- line could be omitted if nice column names are not important. No packages are used.

# test data

Lines <- "pages                         count
[page 1, page 2, page 3]      23
[page 2, page 4]              4
[page 1, page 3, page 4]      12"

# code - replace text=Lines with something like "myfile.dat"

DF <- read.table(text = Lines, skip = 1, sep = "]", as.is = TRUE)
DF2 <- read.table(text = DF[[1]], sep = ",", fill = TRUE, as.is = TRUE)
names(DF2) <- paste0(read.table(text = Lines, nrow = 1, as.is = TRUE)[[1]], seq_along(DF2))
DF2$count <- DF[[2]]
DF2[[1]] <- sub(".", "", DF2[[1]]) # remove [

which gives this:

> DF2
  pages1  pages2  pages3 count
1 page 1  page 2  page 3    23
2 page 2  page 4             4
3 page 1  page 3  page 4    12

Note: This gives columns headings of page1, page2, etc. If it were important to have precisely the column headings shown in the question then replace that line with this which uses those headings if there are less than 20 page columns.

 ord <- c('First', 'Second', 'Third', 'Fourth', 'Fifth', 'Sixth', 'Seventh',
 'Eighth', 'Ninth', 'Tenth', 'Eleventh', 'Twelfth', 'Thirteenth',
 'Fourteenth', 'Fiftheenth', 'Sixteenth', 'Seventeenth', 'Eighteenth', 
 'Nineteenth')
ix <- seq_along(DF2)
names(DF2) <- if (ncol(DF2) < 20) paste(ord[ix], "Page") else paste("Page", ix)

Upvotes: 2

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193517

My "splitstackshape" package has a function that addresses this kind of problem. The relevant function in this case is concat.split and works as follows (using "myDat" from Ricardo's answer):

# Get rid of "[" and "]" from your "pages" variable
myDat$pages <- gsub("\\[|\\]", "", myDat$pages)
# Specify the source data.frame, the variable that needs to be split up
#   and whether to drop the original variable or not
library(splitstackshape)
concat.split(myDat, "pages", ",", drop = TRUE)
#   count pages_1 pages_2 pages_3
# 1    23  page 1  page 2  page 3
# 2     4  page 2  page 4        
# 3    12  page 1  page 3  page 4

Upvotes: 4

Ricardo Saporta
Ricardo Saporta

Reputation: 55350

sample data

myDat <- read.table(text=
  "pages|count
[page 1, page 2, page 3]|23
[page 2, page 4]|4
[page 1, page 3, page 4]|12", header=TRUE, sep="|") 

We can pull pages out of myDat to work on it.

# if factors, convert to characters
pages <- as.character(myDat$page)

# remove brackets.  Note the double-escape's in R
pages <- gsub("(\\[|\\])", "", pages)

# split on comma
pages <- strsplit(pages, ",")

# find the largest element
maxLen <- max(sapply(pages, length))

# fill in any blanks. The t() is to transpose the return from sapply
pages <- 
t(sapply(pages, function(x)
      # append to x, NA's.  Note that if (0 == (maxLen - length(x))), then no NA's are appended 
      c(x, rep(NA, maxLen - length(x)))
  ))

# add column names as necessary
colnames(pages) <- paste(c("First", "Second", "Third"), "Page")

# Put it all back together
data.frame(pages, Count=myDat$count)



Results

> data.frame(pages, Count=myDat$count)
  First.Page Second.Page Third.Page Count
1     page 1      page 2     page 3    23
2     page 2      page 4       <NA>     4
3     page 1      page 3     page 4    12

Upvotes: 3

Related Questions