Reputation: 251
I have a dataset that has widths at every pixel position along a central skeleton. The width is output as a single string that is comma delimited.
cukeDatatest <- read.delim("https://gist.githubusercontent.com/bhive01/e7508f552db0415fec1749d0a390c8e5/raw/a12386d43c936c2f73d550dfdaecb8e453d19cfe/widthtest.tsv")
str(cukeDatatest) # or dplyr::glimpse(cukeDatatest)
I need to keep the File and FruitNum identifiers with the widths.
The output I want has three columns File, FruitNum, ObjectWidth, but File and FruitNum are repeated for the length of ObjectWidth for that fruit. Position is important so sorting these vectors would be really bad. Also, every fruit is a different length (if that matters for your method).
I've used str_split() before to dissect a few elements from a string, but never something this large, nor so many of them (I have 8000 of them). Processing time is a concern, but would wait for correct result.
I'm more used to dplyr than data.table, but I see that there are some efforts from Arun in this: R split text string in a data.table columns
Upvotes: 5
Views: 941
Reputation: 43334
A Hadleyverse option, with some sensible type-conversion tacked on:
library(dplyr)
library(tidyr)
cukeDatatest %>%
# split ObjectWidth into a nested column containing a vector
mutate(ObjectWidth = strsplit(as.character(.$ObjectWidth), ',')) %>%
# unnest nested column, melting data to long form
unnest() %>%
# convert data to integer
mutate(ObjectWidth = as.integer(ObjectWidth))
# Source: local data frame [39,830 x 3]
#
# File FruitNum ObjectWidth
# (fctr) (int) (int)
# 1 IMG_7888.JPGcolcorrected.jpg 1 4
# 2 IMG_7888.JPGcolcorrected.jpg 1 10
# 3 IMG_7888.JPGcolcorrected.jpg 1 14
# 4 IMG_7888.JPGcolcorrected.jpg 1 15
# 5 IMG_7888.JPGcolcorrected.jpg 1 22
# 6 IMG_7888.JPGcolcorrected.jpg 1 26
# 7 IMG_7888.JPGcolcorrected.jpg 1 26
# 8 IMG_7888.JPGcolcorrected.jpg 1 28
# 9 IMG_7888.JPGcolcorrected.jpg 1 34
# 10 IMG_7888.JPGcolcorrected.jpg 1 35
# .. ... ... ...
Here's an equivalent version with a more typical tidyr
approach. One issue with this approach here is the irregular number of terms in ObjectWidth
makes making column names somewhat difficult, as separate
annoyingly contains no defaults for its into
parameter.
A simple workaround here is to deliberately create more columns than you need (the rest of which will be filled with NA
s, which will be subsequently deleted by gather
). While less than ideally efficient, the code still runs instantaneously, so it's not enough of a performance hit to matter. If it bugs you, figure out the length of the longest row with max(sapply(strsplit(as.character(cukeDatatest$ObjectWidth), ','), length))
.
cukeDatatest %>%
# tbl_df conversion is unnecessary, but nice for printing purposes
tbl_df() %>%
# split ObjectWidth on commas into individual columns
separate(ObjectWidth, into = paste0('X', 1:2500),
sep = ',', fill = 'right', convert = TRUE) %>%
# gather into long form
gather(var, ObjectWidth, starts_with('X'), na.rm = TRUE) %>%
# remove key column identifying term number within initial ObjectWidth string
select(-var)
If you have a fixed number of terms in each ObjectWidth
string, plain old read.csv
called upon ObjectWidth
pasted together is a good way to go. read.csv
estimates the number of columns from the first five rows, which is fine if the number is constant.
If that doesn't work (like for this data, where the longest row is the seventh), you face the same names issue as above, which can be sorted by offering col.names
a set of names of the proper length. The same workaround as above works here, too, if necessary.
read.csv(text = paste(as.character(cukeDatatest$ObjectWidth), collapse = '\n'),
header = FALSE, col.names = paste0('V', 1:2179)) %>%
bind_cols(cukeDatatest[,-3]) %>%
gather(var, ObjectWidth, starts_with('V'), na.rm = TRUE) %>%
select(-var)
Both approaches return a tbl_df exactly equivalent to the original approach above.
Upvotes: 4
Reputation: 49448
I'd normally start with simple strsplit
:
dt[, strsplit(ObjectWidth, ",", fixed = T)[[1]], by = .(File, FruitNum)]
If that's too slow, I'd run strsplit
on the entire column and then rearrange the data to my liking:
l = strsplit(dt$ObjectWidth, ",", fixed = T)
dt[inverse.rle(list(lengths = lengths(l), values = seq_along(l))),
.(File, FruitNum)][, col := unlist(l)][]
Upvotes: 4
Reputation: 56054
Using splitstackshape
package
library(splitstackshape)
res <- cSplit(cukeDatatest, splitCols = "ObjectWidth", sep = ",", direction = "long")
# result
head(res)
# File FruitNum ObjectWidth
# 1: IMG_7888.JPGcolcorrected.jpg 1 4
# 2: IMG_7888.JPGcolcorrected.jpg 1 10
# 3: IMG_7888.JPGcolcorrected.jpg 1 14
# 4: IMG_7888.JPGcolcorrected.jpg 1 15
# 5: IMG_7888.JPGcolcorrected.jpg 1 22
# 6: IMG_7888.JPGcolcorrected.jpg 1 26
Upvotes: 5