Farrel
Farrel

Reputation: 10444

Split a column of strings into variable number of columns using data.table in R

I want to analyze many years of of Quicken home finance records. I exported a file to qif and used bank2csv program to render a csv. Within Quicken one can use a category (eg automobile, tax), subcategories (eg automobile:service, automobile:fuel) and tags (eg self, spouse, son). bank2csv renders the categories:subcategories/tag as a concatenated string. I want to instead put the category in a category column, subcategory in a subcategory column and put whatever tags in the tag column. I saw a similar question but alas that worked bystrsplit then unlist and then indexing each element so that it could be written to the correct place by assignment. That will not work here since sometimes there is no tag and sometimes there is no subcategory. It is quite easy to split the string into a list and save that list in a column but how on earth does one assign the first element of the list to one column and the second element (if it exists) to a second column. Surely there is an elegant easy way.

simplified sample

library(data.table)
library(stringi)
dt <- data.table(category.tag=c("toys/David", "toys/David", "toys/James", "toys", "toys", "toys/James"), transaction=1:6)

How do I create a third and fourth column: category, tag. Some of tag would be NA

I can do the following but it does not get me very far. I need a way to specify the first or the second element of the resultant list (as opposed to the whole list)

dt[, category:= strsplit(x = category.tag, split = "/") ]

Upvotes: 3

Views: 1963

Answers (4)

Arun
Arun

Reputation: 118789

Just pushed two functions transpose() and tstrsplit() in data.table v1.9.5.

With this we can do:

require(data.table)
dt[, c("category", "tag") := tstrsplit(category.tag, "/", fixed=TRUE)]
#    category.tag transaction category   tag
# 1:   toys/David           1     toys David
# 2:   toys/David           2     toys David
# 3:   toys/James           3     toys James
# 4:         toys           4     toys    NA
# 5:         toys           5     toys    NA
# 6:   toys/James           6     toys James

tstrsplit is a wrapper for transpose(strsplit(as.character(x), ...)). And you can also pass fill=. to fill missing values with any other value than NA.

transpose() can also be used on lists, data frames and data tables.

Upvotes: 6

G. Grothendieck
G. Grothendieck

Reputation: 269491

1) Try read.table

read <- function(x) read.table(text = x, sep  = "/", fill = TRUE, na.strings = "")
dt[, c("category", "tag") := read(category.tag)]

No extra packages are needed.

2) An alternative is to use separate in the tidyr package:

library(tidyr)
separate(dt, category.tag, c("category", "tag"), extra = "drop")

The above is with tidyr version 0.1.0.9000 from github. To install it ensure that the devtools R package is installed and issue the command: devtools::install_github("hadley/tidyr") .

Update: Incorporated Richard Scrivens' comment and minor improvements. Added tidyr solution.

Upvotes: 2

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193517

Since you already have "stringi" loaded, you can also look at stri_split_fixed and the simplify argument:

setnames(cbind(dt, stri_split_fixed(dt$category.tag, "/", simplify = TRUE)), 
         c("V1", "V2"), c("category", "tag"))[]
#    category.tag transaction category   tag
# 1:   toys/David           1     toys David
# 2:   toys/David           2     toys David
# 3:   toys/James           3     toys James
# 4:         toys           4     toys    NA
# 5:         toys           5     toys    NA
# 6:   toys/James           6     toys James

Though I must admit I'm partial to cSplit :-)

Upvotes: 1

Rich Scriven
Rich Scriven

Reputation: 99331

You could use cSplit

library(splitstackshape)
dt[, c("category", "tag") := cSplit(dt[,.(category.tag)], "category.tag", "/")]
dt
#    category.tag transaction category   tag
# 1:   toys/David           1     toys David
# 2:   toys/David           2     toys David
# 3:   toys/James           3     toys James
# 4:         toys           4     toys    NA
# 5:         toys           5     toys    NA
# 6:   toys/James           6     toys James

Upvotes: 4

Related Questions