Reputation: 13113

merge strings among rows by id

I wish to merge strings among rows by an id variable. I know how to do that with the R code below. However, my code seems vastly overly complex.

In the present case each string has two elements that are not dots. Each pair of consecutive rows within an id have one element in common. So, only one of those elements remains after the two rows are merged.

The desired result is shown and the R code below returns the desired result. Thank you for any suggestions. Sorry my R code is so long and convoluted, but it does work and my goal is to obtain more efficient code in base R.

my.data <- read.table(text = '
     id         my.string
      2    11..................
      2    .1...2..............
      2    .....2...3..........
      5    ....................
      6    ......2.....2.......
      6    ............2...4...
      7    .1...2..............
      7    .....2....3.........
      7    ..........3..3......
      7    .............34.....
      8    ....1.....1.........
      8    ..........12........
      8    ...........2....3...
      9    ..................44
     10    .2.......2..........
     11    ...2...2............
     11    .......2.....2......
     11    .............2...2..
', header = TRUE, na.strings = 'NA', stringsAsFactors = FALSE)
my.data

desired.result <- read.table(text = '
     id         my.string
      2    11...2...3..........
      5    ....................
      6    ......2.....2...4...
      7    .1...2....3..34.....
      8    ....1.....12....3...
      9    ..................44
     10    .2.......2..........
     11    ...2...2.....2...2..
', header = TRUE, na.strings = 'NA', stringsAsFactors = FALSE)

# obtain position of first and last non-dot
# from: http://stackoverflow.com/questions/29229333/position-of-first-and-last-non-dot-in-a-string-with-regex

first.last.dot <- data.frame(my.data, do.call(rbind, gregexpr("^\\.*\\K[^.]|[^.](?=\\.*$)", my.data[,2], perl=TRUE)))

# obtain non-dot elements
first.last.dot$first.element <- as.numeric(substr(first.last.dot$my.string, first.last.dot$X1, first.last.dot$X1))
first.last.dot$last.element  <- as.numeric(substr(first.last.dot$my.string, first.last.dot$X2, first.last.dot$X2))

# obtain some book-keeping variables
first.last.dot$number.within.group <- sequence(rle(first.last.dot$id)$lengths)
most.records.per.id                <- max(first.last.dot$number.within.group)
n.ids                              <- length(unique(first.last.dot$id))

# create matrices for recording data
positions.per.id <- matrix(NA, nrow = (n.ids), ncol=(most.records.per.id+1))
values.per.id    <- matrix(NA, nrow = (n.ids), ncol=(most.records.per.id+1))

# use nested for-loops to fill matrices with data
positions.per.id[1,1] = first.last.dot$X1[1]
   values.per.id[1,1] = first.last.dot$first.element[1]

positions.per.id[1,2] = first.last.dot$X2[1]
   values.per.id[1,2] = first.last.dot$last.element[1]

j = 1

for(i in 2:nrow(first.last.dot)) {

     if(first.last.dot$id[i] != first.last.dot$id[i-1]) j = j + 1

      positions.per.id[j, (first.last.dot$number.within.group[i]+0)] = first.last.dot$X1[i]
      positions.per.id[j, (first.last.dot$number.within.group[i]+1)] = first.last.dot$X2[i]

      values.per.id[j, (first.last.dot$number.within.group[i]+0)] = first.last.dot$first.element[i]
      values.per.id[j, (first.last.dot$number.within.group[i]+1)] = first.last.dot$last.element[i]
}

# convert matrix data into new strings using nested for-loops
new.strings <- matrix(0, nrow = nrow(positions.per.id), ncol = nchar(my.data$my.string[1]))

for(i in 1:nrow(positions.per.id)) {
     for(j in 1:ncol(positions.per.id)) {

          new.strings[i,positions.per.id[i,j]] <- values.per.id[i,j]
     }
}

# format new strings
new.strings[is.na(new.strings)] <- 0
new.strings[new.strings==0]     <- '.'

new.strings2 <- data.frame(id = unique(first.last.dot$id), my.string = (do.call(paste0, as.data.frame(new.strings))), stringsAsFactors = FALSE)
new.strings2

all.equal(desired.result, new.strings2)
# [1] TRUE

Upvotes: 4

Answers (3)

bgoldst

Reputation: 35314

Dude, this was tough. Please don't make me explain what I did.

data.frame(id=unique(my.data$id), my.string=sapply(lapply(unique(my.data$id), function(id) gsub('^$','.',substr(gsub('\\.','',do.call(paste0,strsplit(my.data[my.data$id==id,'my.string'],''))),1,1)) ), function(x) paste0(x,collapse='') ), stringsAsFactors=F );

Ok, I'll explain it:

It begins with this lapply() call:

lapply(unique(my.data$id), function(id) ... )

As you can see, the above basically iterates over the unique ids in the data.frame, processing each one in turn. Here's the contents of the function:

gsub('^$','.',substr(gsub('\\.','',do.call(paste0,strsplit(my.data[my.data$id==id,'my.string'],''))),1,1))

Let's take that in pieces, starting with the innermost subexpression:

strsplit(my.data[my.data$id==id,'my.string'],'')

The above indexes all my.string cells for the current id value, and splits each string using strsplit(). This produces a list of character vectors, with each list component containing a vector of character strings, where the whole vector corresponds to the input string which was split. The use of the empty string as the delimiter causes each individual character in each input string to become an element in the output vector in the list component corresponding to said input string.

Here's an example of what the above expression generates (for id==2):

[[1]]
 [1] "1" "1" "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "."

[[2]]
 [1] "." "1" "." "." "." "2" "." "." "." "." "." "." "." "." "." "." "." "." "." "."

[[3]]
 [1] "." "." "." "." "." "2" "." "." "." "3" "." "." "." "." "." "." "." "." "." "."

The above strsplit() call is wrapped in the following (with the ... representing the previous expression):

do.call(paste0,...)

That calls paste0() once, passing the output vectors that were generated by strsplit() as arguments. This does a kind of element-wise pasting of all vectors, so you end up with a single vector like this, for each unique id:

 [1] "1.." "11." "..." "..." "..." ".22" "..." "..." "..." "..3" "..." "..." "..." "..." "..." "..." "..." "..." "..." "..."

The above paste0() call is wrapped in the following:

gsub('\\.','',...)

That strips all literal dots from all elements, resulting in something like this, for each unique id:

 [1] "1"  "11" ""   ""   ""   "22" ""   ""   ""   "3"  ""   ""   ""   ""   ""   ""   ""   ""   ""   ""

The above gsub() call is wrapped in the following:

substr(...,1,1)

That extracts just the first character of each element, which, if it exists, is the desired character in that position. Empty elements are acceptable, as that just means the id had no non-dot characters in any of its input strings at that position.

The above substr() call is wrapped in the following:

gsub('^$','.',...)

That simply replaces empty elements with a literal dot, which is obviously necessary before we put the string back together. So we have, for id==2:

 [1] "1" "1" "." "." "." "2" "." "." "." "3" "." "." "." "." "." "." "." "." "." "."

That completes the function that was given to the lapply() call. Thus, coming out of that call will be a list of character vectors representing the desired output strings. All that remains is collapsing the elements of those vectors back into a single string, which is why we then need this:

sapply(..., function(x) paste0(x,collapse='') )

Using sapply() (simplify-apply) is appropriate because it automatically combines all desired output strings into a single character vector, rather than leaving them as a list:

[1] "11...2...3.........." "...................." "......2.....2...4..." ".1...2....3..34....." "....1.....12....3..." "..................44" ".2.......2.........." "...2...2.....2...2.."

Thus, all that remains is producing the full output data.frame, similar to the input data.frame:

data.frame(id=unique(my.data$id), my.string=..., stringsAsFactors=F )

Resulting in:

  id            my.string
1  2 11...2...3..........
2  5 ....................
3  6 ......2.....2...4...
4  7 .1...2....3..34.....
5  8 ....1.....12....3...
6  9 ..................44
7 10 .2.......2..........
8 11 ...2...2.....2...2..

And we're done!

Upvotes: 2

Henrik

Reputation: 67778

Here's a possibility using functions from stringi and dplyr packages:

library(stringi)
library(dplyr)

# split my.string
m <- stri_split_boundaries(my.data$my.string, type = "character", simplify = TRUE)

df <- data.frame(id = my.data$id, m)

# function to apply to each column - select "." or unique "number"
myfun <- function(x) if(all(x == ".")) "." else unique(x[x != "."])


df %>%
  # for each id...
  group_by(id) %>%

  # ...and each column, apply function
  summarise_each(funs(myfun)) %>%

  # for each row...
  rowwise() %>%

 #...concatenate strings 
  do(data.frame(id = .[1], mystring = paste(.[-1], collapse = "")))

#   id             mystring
# 1  2 11...2...3..........
# 2  5 ....................
# 3  6 ......2.....2...4...
# 4  7 .1...2....3..34.....
# 5  8 ....1.....12....3...
# 6  9 ..................44
# 7 10 .2.......2..........
# 8 11 ...2...2.....2...2..

Upvotes: 2

eddi

Reputation: 49448

Doing this in base R is a bit masochistic, so I won't do that, but with some perseverance you can do it yourself. Here's a data.table version instead (you'll need to install the latest 1.9.5 version from github to get tstrsplit):

library(data.table)
dt = as.data.table(my.data) # or setDT to convert in place

dt[, paste0(lapply(tstrsplit(my.string, ""),
                   function(i) {
                     res = i[i != "."];
                     if (length(res) > 0)
                       res[1]
                     else
                       '.'
                   }), collapse = "")
   , by = id]
#   id                   V1
#1:  2 11...2...3..........
#2:  5 ....................
#3:  6 ......2.....2...4...
#4:  7 .1...2....3..34.....
#5:  8 ....1.....12....3...
#6:  9 ..................44
#7: 10 .2.......2..........
#8: 11 ...2...2.....2...2..

Upvotes: 2

merge strings among rows by id

Answers (3)

Related Questions