CLM
CLM

Reputation: 119

How can this code be compacted?

Can the following code be made more "R like"?

Given data.frame inDF:

    V1         V2       V3        V4  
1   a          ha       1;2;3     A
2   c          hb       4         B
3   d          hc       5;6       C
4   f          hd       7         D

Inside df I want to

  1. find all rows which for the "V3" column has multiple values separated by ";"
  2. then replicate the respective rows a number of times equal with the number of individual values in the "V3" column,
  3. and then each replicated row receives in the "V3" column only one the initial values

Shortly, the output data.frame (= outDF) will look like:

    V1         V2       V3        V4  
1   a          ha       1         A
1   a          ha       2         A
1   a          ha       3         A
2   c          hb       4         B
3   d          hc       5         C
3   d          hc       6         C
4   f          hd       7         D

So, if from inDF I want to get to outDF, I would write the following code:

#load inDF from csv file
inDF <- read.csv(file='example.csv', header=FALSE, sep=",", fill=TRUE)  

#search in inDF, on the V3 column, all the cells with multiple values
rowlist <- grep(";", inDF[,3])

# create empty data.frame and add headers from "headDF"
xDF <- data.frame(matrix(0, nrow=0, ncol=4))
colnames(xDF)=colnames(inDF)

#take every row from the inDF data.frame which has multiple values in col3 and break it in several rows with only one value

for(i in rowlist[])
{ 
  #count the number of individual values in one cell
  value_nr <- str_count(inDF[i,3], ";"); value_nr <- value_nr+1

  # replicate each row a number of times equal with its value number, and transform it to character
  extracted_inDF <- inDF[rep(i, times=value_nr[]),]
  extracted_inDF <- data.frame(lapply(extracted_inDF, as.character), stringsAsFactors=FALSE)

  # split the values in V3 cell in individual values, place them in a list
  value_ls <- str_split(inDF[i, 3], ";")

  #initialize f, to use it later to increment both row number and element in the list of values
  f = 1

  # replace the multiple values with individual values
  for(j in extracted_inDF[,3])

    {
    extracted_inDF[f,3] <- value_ls[[1]][as.integer(f)]
    f <- f+1
  }

  #put all the "demultiplied" rows in xDF
  xDF <- merge(extracted_inDF[], xDF[], all=TRUE)
}

# delete the rows with multiple values from the inDF
inDF <- inDF[-rowlist[],]

#create outDF
outDF <- merge(inDF, xDF, all=TRUE)

Could you please

Upvotes: 2

Views: 126

Answers (2)

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193667

I'm not sure that I'm one to speak about whether you are using R in the "right" or "wrong" way... I mostly just use it to answer questions on Stack Overflow. :-)

However, there are many ways in which your code could be improved. For starters, YES, you should try to become familiar with the predefined functions. They will often be much more efficient, and will make your code much more transparent to other users of the same language. Despite your concise description of what you wanted to achieve, and my knowing an answer virtually right away, I found your code daunting to look through.

I would break up your problem into two main pieces: (1) splitting up the data and (2) recombining it with your original dataset.

For part 1: You obviously know some of the functions you need--or at least the main one you need: strsplit. If you use strsplit, you'll see that it returns a list, but you need a simple vector. How do you get there? Look for unlist. The first part of your problem is now solved.

For part 2: You first need to determine how many times you need to replicate each row of your original dataset. For this, you drill through your list (for example, with l/s/v-apply) and count each item's length. I picked sapply since I knew it would create a vector that I could use with rep.

Then, if you've played with data.frames enough, particularly with extracting data, you would have come to realize that mydf[c(1, 1, 1, 2), ] will result in a data.frame where the first row is repeated two additional times. Knowing this, we can use the length calculation we just made to "expand" our original data.frame.

Finally, with that expanded data.frame, we just need to replace the relevant column with the unlisted values.


Here is the above in action. I've named your dataset "mydf":

V3 <- strsplit(mydf$V3, ";", fixed=TRUE)
sapply(V3, length)    ## How many times to repeat each row?
# [1] 3 1 2 1
## ^^ Use that along with `[` to "expand" your data.frame
mydf2 <- mydf[rep(seq_along(V3), sapply(V3, length)), ]
mydf2$V3 <- unlist(V3)
mydf2
#     V1 V2 V3 V4
# 1    a ha  1  A
# 1.1  a ha  2  A
# 1.2  a ha  3  A
# 2    c hb  4  B
# 3    d hc  5  C
# 3.1  d hc  6  C
# 4    f hd  7  D

To share some more options...

The "data.table" package can actually be pretty useful for something like this.

library(data.table)
DT <- data.table(mydf)
DT2 <- DT[, list(new = unlist(strsplit(as.character(V3), ";", fixed = TRUE))), by = V1]
merge(DT, DT2, by = "V1")

Alternatively, concat.split.multiple from my "splitstackshape" package pretty much does it in one step, but if you want your exact output, you'll need to drop the NA values and reorder the rows.

library(splitstackshape)
df2 <- concat.split.multiple(mydf, split.cols="V3", seps=";", direction="long")
df2 <- df2[complete.cases(df2), ]   ## Optional, perhaps
df2[order(df2$V1), ]                ## Optional, perhaps

Upvotes: 3

josliber
josliber

Reputation: 44340

In this case, you can use the split-apply-combine paradigm for reshaping the data.

You want to split inDF by its rows, since you want to operate on each row separately. I've used the split function here to split it up by row:

spl = split(inDF, 1:nrow(inDF))

spl is a list that contains a 1-row data frame for each row in inDF.

Next, you'll want to apply a function to transform the split up data into the final format you need. Here, I'll use the lapply function to transform the 1-row data frames, using strsplit to break up the variable V3 into its appropriate parts:

transformed = lapply(spl, function(x) {
  data.frame(V1=x$V1, V2=x$V2, V3=strsplit(x$V3, ";")[[1]], V4=x$V4)
})

tranformed is now a list where the first element has a 3-row data frame, the third element has a 2-row data frame, and the second and fourth have 1-row data frames.

The last step is to combine this list together into outDF, using do.call with the rbind function. That has the same effect of calling rbind with all of the elements of the transformed list.

outDF = do.call(rbind, transformed)

This yields the desired final data frame:

outDF
#     V1 V2 V3 V4
# 1.1  a ha  1  A
# 1.2  a ha  2  A
# 1.3  a ha  3  A
# 2    c hb  4  B
# 3.1  d hc  5  C
# 3.2  d hc  6  C
# 4    f hd  7  D

Upvotes: 2

Related Questions