Reputation: 2042
This has been asked in various places for SQL, but can't find a simple way to do it in R
I have a dataset like this
Ids v1 v2 v3 v4 v5
548|14721,678|17604,716|18316,732|18505,745|18626,752|18716 9186 639 9045 316 28396
548|14721,678|17603,716|18316,732|18507,745|18626,752|18716 9041 598 8897 283 28054
548|14722,678|17603,716|18316,732|18507,745|18626,752|18716 8799 588 8669 246 27433
548|14721,678|17603,716|18316,732|18505,745|18626,752|18715 8914 614 8765 273 27347
716|18316,745|18626 4113 497 4050 270 27267
548|14722,678|17604,716|18316,732|18507,745|18626,752|18716 8829 589 8713 254 25270
I want to split based on the first column on comma and create a row for each entry, repeating the entries in the rest of the rows.
The function I have written is
split.data <- function(data, split = ",") {
y <- NULL
for( i in 1:nrow(data)){
y <- rbind(y,cbind(data.frame(unlist(strsplit(as.character(data[i,1]), split = split))),data[i,-1]))
}
names(y) <- names(data)
y <- sapply(y,as.character)
return(as.data.frame(y))
}
This works but its extremely slow. Is there a way to vectorize this and make it faster. The file I have has over 5000 such rows and it takes a while to run.
Related articles I have found for SQL: Split values over multiple rows Turning a Comma Separated string into individual rows
Upvotes: 4
Views: 5438
Reputation: 193527
These steps can be combined if you use concat.split.multiple
from my "splitstackshape" package. This function makes use of count.fields
automatically, so it shouldn't suffer from the problem mentioned in the comments:
library(splitstackshape)
out <- concat.split.multiple(mydf, "Ids", seps=",", "long")
head(out)
# v1 v2 v3 v4 v5 time Ids
# 1 9186 639 9045 316 28396 1 548|14721
# 2 9041 598 8897 283 28054 1 548|14721
# 3 8799 588 8669 246 27433 1 548|14722
# 4 8914 614 8765 273 27347 1 548|14721
# 5 4113 497 4050 270 27267 1 716|18316
# 6 8829 589 8713 254 25270 1 548|14722
tail(out)
# v1 v2 v3 v4 v5 time Ids
# 31 9186 639 9045 316 28396 6 752|18716
# 32 9041 598 8897 283 28054 6 752|18716
# 33 8799 588 8669 246 27433 6 752|18716
# 34 8914 614 8765 273 27347 6 752|18715
# 35 4113 497 4050 270 27267 6 <NA>
# 36 8829 589 8713 254 25270 6 752|18716
You need to do a combination of splitting the "Ids" strings followed by "reshaping" your data (if I understand your goal correctly).
Instead of splitting the way you did, I've gone ahead and taken advantage of read.csv
and the argument fill = TRUE
. Note that if the values in your "Ids" column are currently factors, you'll need to use text = as.character(mydf$Ids)
instead.
Here's how I'd proceed:
mydf2 <- cbind(read.csv(text = mydf$Ids, fill = TRUE, header = FALSE), mydf[-1])
mydf2
# V1 V2 V3 V4 V5 V6 v1 v2 v3 v4 v5
# 1 548|14721 678|17604 716|18316 732|18505 745|18626 752|18716 9186 639 9045 316 28396
# 2 548|14721 678|17603 716|18316 732|18507 745|18626 752|18716 9041 598 8897 283 28054
# 3 548|14722 678|17603 716|18316 732|18507 745|18626 752|18716 8799 588 8669 246 27433
# 4 548|14721 678|17603 716|18316 732|18505 745|18626 752|18715 8914 614 8765 273 27347
# 5 716|18316 745|18626 4113 497 4050 270 27267
# 6 548|14722 678|17604 716|18316 732|18507 745|18626 752|18716 8829 589 8713 254 25270
That data is currently in a "wide" format. Let's make it into a "long" format. The reshape
function needs a little bit of information on how to proceed. In particular, it needs to know:
idvar
s.read.csv
, and by manually checking their indexes, we can see that they are in positions 1 through 6. Obviously, you would need to specify the column numbers from your actual dataset.For the example dataset you provided, we would use reshape
as follows:
mydf3 <- reshape(mydf2, direction = "long", idvar=7:ncol(mydf2),
varying=1:6, sep = "")
rownames(mydf3) <- NULL
Here's the head
and tail
of the resulting data.frame
:
> head(mydf3)
v1 v2 v3 v4 v5 time V
1 9186 639 9045 316 28396 1 548|14721
2 9041 598 8897 283 28054 1 548|14721
3 8799 588 8669 246 27433 1 548|14722
4 8914 614 8765 273 27347 1 548|14721
5 4113 497 4050 270 27267 1 716|18316
6 8829 589 8713 254 25270 1 548|14722
> tail(mydf3)
v1 v2 v3 v4 v5 time V
31 9186 639 9045 316 28396 6 752|18716
32 9041 598 8897 283 28054 6 752|18716
33 8799 588 8669 246 27433 6 752|18716
34 8914 614 8765 273 27347 6 752|18715
35 4113 497 4050 270 27267 6
36 8829 589 8713 254 25270 6 752|18716
This is all assuming we are starting with an object named "mydf" that looks like this:
mydf <- structure(list(Ids = c("548|14721,678|17604,716|18316,732|18505,745|18626,752|18716",
"548|14721,678|17603,716|18316,732|18507,745|18626,752|18716",
"548|14722,678|17603,716|18316,732|18507,745|18626,752|18716",
"548|14721,678|17603,716|18316,732|18505,745|18626,752|18715",
"716|18316,745|18626", "548|14722,678|17604,716|18316,732|18507,745|18626,752|18716"
), v1 = c(9186L, 9041L, 8799L, 8914L, 4113L, 8829L), v2 = c(639L,
598L, 588L, 614L, 497L, 589L), v3 = c(9045L, 8897L, 8669L, 8765L,
4050L, 8713L), v4 = c(316L, 283L, 246L, 273L, 270L, 254L), v5 = c(28396L,
28054L, 27433L, 27347L, 27267L, 25270L)), .Names = c("Ids", "v1",
"v2", "v3", "v4", "v5"), class = "data.frame", row.names = c(NA,
-6L))
Upvotes: 12