How can this code be compacted?

Question

Can the following code be made more "R like"?

Given data.frame inDF:

    V1         V2       V3        V4  
1   a          ha       1;2;3     A
2   c          hb       4         B
3   d          hc       5;6       C
4   f          hd       7         D

Inside df I want to

find all rows which for the "V3" column has multiple values separated by ";"
then replicate the respective rows a number of times equal with the number of individual values in the "V3" column,
and then each replicated row receives in the "V3" column only one the initial values

Shortly, the output data.frame (= outDF) will look like:

    V1         V2       V3        V4  
1   a          ha       1         A
1   a          ha       2         A
1   a          ha       3         A
2   c          hb       4         B
3   d          hc       5         C
3   d          hc       6         C
4   f          hd       7         D

So, if from inDF I want to get to outDF, I would write the following code:

#load inDF from csv file
inDF <- read.csv(file='example.csv', header=FALSE, sep=",", fill=TRUE)  

#search in inDF, on the V3 column, all the cells with multiple values
rowlist <- grep(";", inDF[,3])

# create empty data.frame and add headers from "headDF"
xDF <- data.frame(matrix(0, nrow=0, ncol=4))
colnames(xDF)=colnames(inDF)

#take every row from the inDF data.frame which has multiple values in col3 and break it in several rows with only one value

for(i in rowlist[])
{ 
  #count the number of individual values in one cell
  value_nr <- str_count(inDF[i,3], ";"); value_nr <- value_nr+1

  # replicate each row a number of times equal with its value number, and transform it to character
  extracted_inDF <- inDF[rep(i, times=value_nr[]),]
  extracted_inDF <- data.frame(lapply(extracted_inDF, as.character), stringsAsFactors=FALSE)

  # split the values in V3 cell in individual values, place them in a list
  value_ls <- str_split(inDF[i, 3], ";")

  #initialize f, to use it later to increment both row number and element in the list of values
  f = 1

  # replace the multiple values with individual values
  for(j in extracted_inDF[,3])

    {
    extracted_inDF[f,3] <- value_ls[[1]][as.integer(f)]
    f <- f+1
  }

  #put all the "demultiplied" rows in xDF
  xDF <- merge(extracted_inDF[], xDF[], all=TRUE)
}

# delete the rows with multiple values from the inDF
inDF <- inDF[-rowlist[],]

#create outDF
outDF <- merge(inDF, xDF, all=TRUE)

Could you please

A5C1D2H2I1M1N2O1R2T1 · Accepted Answer

I'm not sure that I'm one to speak about whether you are using R in the "right" or "wrong" way... I mostly just use it to answer questions on Stack Overflow. :-)

However, there are many ways in which your code could be improved. For starters, YES, you should try to become familiar with the predefined functions. They will often be much more efficient, and will make your code much more transparent to other users of the same language. Despite your concise description of what you wanted to achieve, and my knowing an answer virtually right away, I found your code daunting to look through.

I would break up your problem into two main pieces: (1) splitting up the data and (2) recombining it with your original dataset.

For part 1: You obviously know some of the functions you need--or at least the main one you need: strsplit. If you use strsplit, you'll see that it returns a list, but you need a simple vector. How do you get there? Look for unlist. The first part of your problem is now solved.

For part 2: You first need to determine how many times you need to replicate each row of your original dataset. For this, you drill through your list (for example, with l/s/v-apply) and count each item's length. I picked sapply since I knew it would create a vector that I could use with rep.

Then, if you've played with data.frames enough, particularly with extracting data, you would have come to realize that mydf[c(1, 1, 1, 2), ] will result in a data.frame where the first row is repeated two additional times. Knowing this, we can use the length calculation we just made to "expand" our original data.frame.

Finally, with that expanded data.frame, we just need to replace the relevant column with the unlisted values.

Here is the above in action. I've named your dataset "mydf":

V3 <- strsplit(mydf$V3, ";", fixed=TRUE)
sapply(V3, length)    ## How many times to repeat each row?
# [1] 3 1 2 1
## ^^ Use that along with `[` to "expand" your data.frame
mydf2 <- mydf[rep(seq_along(V3), sapply(V3, length)), ]
mydf2$V3 <- unlist(V3)
mydf2
#     V1 V2 V3 V4
# 1    a ha  1  A
# 1.1  a ha  2  A
# 1.2  a ha  3  A
# 2    c hb  4  B
# 3    d hc  5  C
# 3.1  d hc  6  C
# 4    f hd  7  D

To share some more options...

The "data.table" package can actually be pretty useful for something like this.

library(data.table)
DT <- data.table(mydf)
DT2 <- DT[, list(new = unlist(strsplit(as.character(V3), ";", fixed = TRUE))), by = V1]
merge(DT, DT2, by = "V1")

Alternatively, concat.split.multiple from my "splitstackshape" package pretty much does it in one step, but if you want your exact output, you'll need to drop the NA values and reorder the rows.

library(splitstackshape)
df2 <- concat.split.multiple(mydf, split.cols="V3", seps=";", direction="long")
df2 <- df2[complete.cases(df2), ]   ## Optional, perhaps
df2[order(df2$V1), ]                ## Optional, perhaps

How can this code be compacted?

Answers (2)

Related Questions