Reputation: 6728
I have a data.frame with an ugly column with structured data. Each Column can hold from 1 to 40 values of interest. Each value is separated with a html break "<br />"
. The extracted value as the form of a 1.1
, i.e. an integer a period and another integer.
How to separate and melt these columns into different rows?
I know lapply and tidy::separate probably are the ways to go. But I have not succeeded yet. So asking for help.
testdata is here:
testdata <- dget("http://pastebin.com/download.php?i=VS2cq2rB")
The data frame hold two coloumns: "id"
, and "moduler"
.
I'd like to have "id" and "value" instead. The end result should be something like this.
"id", "value"
1, 1.1
1, 1.2
1, 1.3
1, 2.4
2, 1.1
2, 1.3
2, 3.3
This it my latest take - pretty far from where I started with lapply.
origdf <- data.frame()
#names(newdf) <- c("id", 'pnummer', 'moduler')
for (i in 1:nrow(hs)) {
newdf <- data.frame()
newdf[i, 'id'] <- hs[i, 'id']
newdf[i, 'pnummer'] <- hs[i, 'pnummer']
tmp <- unlist(strsplit(as.character(hs[i,'moduler']), "<br />", fixed=T))
for (m in 3:length(tmp)+3) {
newdf[i, m] <- tmp[m]
}
origdf <- dplyr::bind_rows(newdf, origdf)
}
Upvotes: 1
Views: 75
Reputation: 886978
Here is another option using unnest
from tidyr
. We extract the numeric part ([0-9.]+
) using str_extract_all
from library(stringr)
. The output is a list
. We set the names of the list
elements as the 'id' column of 'testdata' and unnest
library(tidyr)
library(stringr)
res <- unnest(setNames(lapply(str_extract_all(testdata$moduler, '[0-9.]+'),
as.numeric), testdata$id), id)
colnames(res)[2] <- 'value'
head(res)
# id value
#1 2862 1.1
#2 2862 1.2
#3 2862 1.3
#4 2862 1.4
#5 2862 1.5
#6 2862 1.6
dim(res)
#[1] 136 2
Or a base R
approach would be to extract the numeric elements with regmatches/gregexpr
in a list
, get the length
of the list
element with lengths
, replicate the 'id' column from 'testdata' based on that, unlist
the 'lst' and create a new 'data.frame'.
lst <- lapply(regmatches(testdata$moduler, gregexpr('[0-9.]+',
testdata$moduler)), as.numeric)
res2 <- data.frame(id = testdata$id[rep(1:nrow(testdata), lengths(lst))],
value= unlist(lst))
Upvotes: 0
Reputation: 92282
Here's a possible data.table
approach. Basically I'm just splitting moduler
by "<br />"
or "<br />Installationsmontør"
by id
library(data.table)
setDT(testdata)[, .(value = unlist(strsplit(as.character(moduler),
"<br />|<br />Installationsmontør"))), by = id]
# id value
# 1: 2862 1.1
# 2: 2862 1.2
# 3: 2862 1.3
# 4: 2862 1.4
# 5: 2862 1.5
# ---
# 132: 2877 3.6
# 133: 2877 4.1
# 134: 2877 4.4
# 135: 2877 4.5
# 136: 2877 4.6
Or similarly with the splitstackshape
package
library(splitstackshape)
cSplit(testdata, splitCols = "moduler",
sep = "<br />|<br />Installationsmontør",
direction = "long", fixed = FALSE, stripWhite = FALSE)
Upvotes: 3
Reputation: 90
I would try to use strsplit function with a simple loop:
newdata <- NULL
a <- 1
b <- 0
for (k in 1:length(testdata$moduler)) {
M <- unlist(strsplit(as.character(testdata$moduler[k]),"<br />|<br />Installationsmontør"))
b <- b + length(M)
newdata$moduler[a:b] <- M
newdata$id[a:b] <- testdata$id[k]
a <- b + 1
}
newdata <- as.data.frame(newdata)
Upvotes: 1