bhive01
bhive01

Reputation: 251

Split string and transpose result

I have a dataset that has widths at every pixel position along a central skeleton. The width is output as a single string that is comma delimited.

cukeDatatest <- read.delim("https://gist.githubusercontent.com/bhive01/e7508f552db0415fec1749d0a390c8e5/raw/a12386d43c936c2f73d550dfdaecb8e453d19cfe/widthtest.tsv")
str(cukeDatatest) # or dplyr::glimpse(cukeDatatest)

I need to keep the File and FruitNum identifiers with the widths.

The output I want has three columns File, FruitNum, ObjectWidth, but File and FruitNum are repeated for the length of ObjectWidth for that fruit. Position is important so sorting these vectors would be really bad. Also, every fruit is a different length (if that matters for your method).

I've used str_split() before to dissect a few elements from a string, but never something this large, nor so many of them (I have 8000 of them). Processing time is a concern, but would wait for correct result.

I'm more used to dplyr than data.table, but I see that there are some efforts from Arun in this: R split text string in a data.table columns

Upvotes: 5

Views: 941

Answers (3)

alistaire
alistaire

Reputation: 43334

A Hadleyverse option, with some sensible type-conversion tacked on:

library(dplyr)
library(tidyr)

cukeDatatest %>% 
    # split ObjectWidth into a nested column containing a vector
    mutate(ObjectWidth = strsplit(as.character(.$ObjectWidth), ',')) %>% 
    # unnest nested column, melting data to long form
    unnest() %>% 
    # convert data to integer
    mutate(ObjectWidth = as.integer(ObjectWidth))

# Source: local data frame [39,830 x 3]
# 
#                            File FruitNum ObjectWidth
#                          (fctr)    (int)       (int)
# 1  IMG_7888.JPGcolcorrected.jpg        1           4
# 2  IMG_7888.JPGcolcorrected.jpg        1          10
# 3  IMG_7888.JPGcolcorrected.jpg        1          14
# 4  IMG_7888.JPGcolcorrected.jpg        1          15
# 5  IMG_7888.JPGcolcorrected.jpg        1          22
# 6  IMG_7888.JPGcolcorrected.jpg        1          26
# 7  IMG_7888.JPGcolcorrected.jpg        1          26
# 8  IMG_7888.JPGcolcorrected.jpg        1          28
# 9  IMG_7888.JPGcolcorrected.jpg        1          34
# 10 IMG_7888.JPGcolcorrected.jpg        1          35
# ..                          ...      ...         ...

Edit

Here's an equivalent version with a more typical tidyr approach. One issue with this approach here is the irregular number of terms in ObjectWidth makes making column names somewhat difficult, as separate annoyingly contains no defaults for its into parameter.

A simple workaround here is to deliberately create more columns than you need (the rest of which will be filled with NAs, which will be subsequently deleted by gather). While less than ideally efficient, the code still runs instantaneously, so it's not enough of a performance hit to matter. If it bugs you, figure out the length of the longest row with max(sapply(strsplit(as.character(cukeDatatest$ObjectWidth), ','), length)).

cukeDatatest %>%
    # tbl_df conversion is unnecessary, but nice for printing purposes
    tbl_df() %>%
    # split ObjectWidth on commas into individual columns
    separate(ObjectWidth, into = paste0('X', 1:2500), 
        sep = ',', fill = 'right', convert = TRUE) %>% 
    # gather into long form
    gather(var, ObjectWidth, starts_with('X'), na.rm = TRUE) %>% 
    # remove key column identifying term number within initial ObjectWidth string
    select(-var)

If you have a fixed number of terms in each ObjectWidth string, plain old read.csv called upon ObjectWidth pasted together is a good way to go. read.csv estimates the number of columns from the first five rows, which is fine if the number is constant.

If that doesn't work (like for this data, where the longest row is the seventh), you face the same names issue as above, which can be sorted by offering col.names a set of names of the proper length. The same workaround as above works here, too, if necessary.

read.csv(text = paste(as.character(cukeDatatest$ObjectWidth), collapse = '\n'), 
         header = FALSE, col.names = paste0('V', 1:2179)) %>% 
    bind_cols(cukeDatatest[,-3]) %>% 
    gather(var, ObjectWidth, starts_with('V'), na.rm = TRUE) %>% 
    select(-var)

Both approaches return a tbl_df exactly equivalent to the original approach above.

Upvotes: 4

eddi
eddi

Reputation: 49448

I'd normally start with simple strsplit:

dt[, strsplit(ObjectWidth, ",", fixed = T)[[1]], by = .(File, FruitNum)]

If that's too slow, I'd run strsplit on the entire column and then rearrange the data to my liking:

l = strsplit(dt$ObjectWidth, ",", fixed = T)

dt[inverse.rle(list(lengths = lengths(l), values = seq_along(l))),
   .(File, FruitNum)][, col := unlist(l)][]

Upvotes: 4

zx8754
zx8754

Reputation: 56054

Using splitstackshape package

library(splitstackshape)
res <- cSplit(cukeDatatest, splitCols = "ObjectWidth", sep = ",", direction = "long")

# result
head(res)
#                            File FruitNum ObjectWidth
# 1: IMG_7888.JPGcolcorrected.jpg        1           4
# 2: IMG_7888.JPGcolcorrected.jpg        1          10
# 3: IMG_7888.JPGcolcorrected.jpg        1          14
# 4: IMG_7888.JPGcolcorrected.jpg        1          15
# 5: IMG_7888.JPGcolcorrected.jpg        1          22
# 6: IMG_7888.JPGcolcorrected.jpg        1          26

Upvotes: 5

Related Questions