Andreas
Andreas

Reputation: 6728

Split arbitrary column into melted data frame

I have a data.frame with an ugly column with structured data. Each Column can hold from 1 to 40 values of interest. Each value is separated with a html break "<br />". The extracted value as the form of a 1.1, i.e. an integer a period and another integer.

How to separate and melt these columns into different rows?

I know lapply and tidy::separate probably are the ways to go. But I have not succeeded yet. So asking for help.

testdata is here:

testdata <- dget("http://pastebin.com/download.php?i=VS2cq2rB")

The data frame hold two coloumns: "id", and "moduler".

I'd like to have "id" and "value" instead. The end result should be something like this.

"id", "value"
1, 1.1
1, 1.2
1, 1.3
1, 2.4
2, 1.1
2, 1.3
2, 3.3

This it my latest take - pretty far from where I started with lapply.

origdf <- data.frame()
#names(newdf) <- c("id", 'pnummer', 'moduler')

for (i in 1:nrow(hs)) {
  newdf <- data.frame()
  newdf[i, 'id'] <- hs[i, 'id']
  newdf[i, 'pnummer'] <- hs[i, 'pnummer']
  tmp <- unlist(strsplit(as.character(hs[i,'moduler']), "<br />", fixed=T))
  for (m in 3:length(tmp)+3) {
    newdf[i, m] <- tmp[m]
  }
  origdf <- dplyr::bind_rows(newdf, origdf)
}

Upvotes: 1

Views: 75

Answers (3)

akrun
akrun

Reputation: 886978

Here is another option using unnest from tidyr. We extract the numeric part ([0-9.]+) using str_extract_all from library(stringr). The output is a list. We set the names of the list elements as the 'id' column of 'testdata' and unnest

 library(tidyr)
 library(stringr) 
 res <- unnest(setNames(lapply(str_extract_all(testdata$moduler, '[0-9.]+'), 
                   as.numeric), testdata$id), id)
 colnames(res)[2] <- 'value'
 head(res)
 #   id  value
 #1 2862 1.1
 #2 2862 1.2
 #3 2862 1.3
 #4 2862 1.4
 #5 2862 1.5
 #6 2862 1.6

 dim(res)
 #[1] 136   2

Or a base R approach would be to extract the numeric elements with regmatches/gregexpr in a list, get the length of the list element with lengths, replicate the 'id' column from 'testdata' based on that, unlist the 'lst' and create a new 'data.frame'.

 lst <- lapply(regmatches(testdata$moduler, gregexpr('[0-9.]+',
              testdata$moduler)), as.numeric)
 res2 <- data.frame(id = testdata$id[rep(1:nrow(testdata), lengths(lst))],
                            value= unlist(lst))

Upvotes: 0

David Arenburg
David Arenburg

Reputation: 92282

Here's a possible data.table approach. Basically I'm just splitting moduler by "<br />" or "<br />Installationsmontør" by id

library(data.table)
setDT(testdata)[, .(value = unlist(strsplit(as.character(moduler), 
                            "<br />|<br />Installationsmontør"))), by = id]
#        id value
#   1: 2862   1.1
#   2: 2862   1.2
#   3: 2862   1.3
#   4: 2862   1.4
#   5: 2862   1.5
# ---           
# 132: 2877   3.6
# 133: 2877   4.1
# 134: 2877   4.4
# 135: 2877   4.5
# 136: 2877   4.6

Or similarly with the splitstackshape package

library(splitstackshape)
cSplit(testdata, splitCols = "moduler", 
       sep = "<br />|<br />Installationsmontør", 
       direction = "long", fixed = FALSE, stripWhite = FALSE)

Upvotes: 3

T.Des
T.Des

Reputation: 90

I would try to use strsplit function with a simple loop:

newdata  <-  NULL
a <- 1
b <- 0
for (k in 1:length(testdata$moduler)) {
  M <- unlist(strsplit(as.character(testdata$moduler[k]),"<br />|<br />Installationsmontør"))
  b <- b + length(M)
  newdata$moduler[a:b] <- M
  newdata$id[a:b] <- testdata$id[k]
  a <- b + 1
}
newdata <- as.data.frame(newdata)

Upvotes: 1

Related Questions