Reputation: 645
Sorry for the noob question, but after a few days I haven't been able to figure out how to do this. I've been trying to use R. Simply put, I have two columns as follows
A:B:C:D:F | 1.1:2.1:3.1:4.1:6.1
A:B:D:F | 1.2:2.2:4.2:6.2
A:B:C:F | 1.3:2.3:3.3:6.3
B:C:D:F | 2.4:3.4:4.4:6.4
Note the delimiter is ':'. At the end I want to have this:
A | B | C | D | E | F
1.1 | 2.1 | 3.1 | 4.1 | NA | 6.1
1.2 | 2.2 | NA | 4.2 | NA | 6.2
1.3 | 2.3 | 3.3 | NA | NA | 6.3
NA | 2.4 | 3.4 | 4.4 | NA | 6.4
Why I can't solve it:
The values in the second column are different for each row, so I need to loop check whether there's a string in column 1, row x and insert the number in the corresponding column 2, row x if it exists. **I have just chosen 1.1, 1.2 etc, for rows to make it easier to conceptualize.
0's or NAs aren't included in column 1, so I need to skip columns when there's missing values, for example in the toy example row 2 is missing C and E proposed columns. Each row does not have a required number of strings and corresponding values (row 1 has 5 strings, row 2 through 4 have 4
I would imagine something similar to the following, replacing "1" for the corresponding column 2, row x value, but I have no idea how to do this. Another approach I experimented with but was stuck at the inserting column 2 values was to create columns with 1 and 0 only based on if a string was present, which was how I came across the code snippet.
df$A <- ifelse(grepl("A", df$PASS, ignore.case = T), "1", "0")
Sorry for the long writeup, but I'm super stuck. I feel this is beyond my beginner level R. Major kudos to anyone that can solve this!
Upvotes: 0
Views: 106
Reputation: 193657
I'd recommend looking at a combination of cSplit
from my "splitstackshape" package along with dcast
:
library(splitstackshape)
dcast(cSplit(as.data.table(df1)[, id := 1:nrow(df1)], names(df1), ":", "long"),
id ~ col1, value.var = "col2")
id A B C D F
1: 1 1.1 2.1 3.1 4.1 6.1
2: 2 1.2 2.2 NA 4.2 6.2
3: 3 1.3 2.3 3.3 NA 6.3
4: 4 NA 2.4 3.4 4.4 6.4
Upvotes: 1
Reputation: 79318
Using base R and writing in a general format:
using the data
df1 <- structure(list(col1 = c("A:B:C:D:F", "A:B:D:F", "A:B:C:F", "B:C:D:F"), col2 = c
("1.1:2.1:3.1:4.1:6.1", "1.2:2.2:4.2:6.2", "1.3:2.3:3.3:6.3", "2.4:3.4:4.4:6.4")),
.Names = c("col1", "col2"), class = "data.frame", row.names = c(NA, -4L))
Then the code will be
a=apply(df1,1,strsplit,":",fixed=TRUE)
b=lapply(a,function(x){y=`names<-`(as.numeric(x[[2]]),x[[1]]);data.frame(t(y))})
d=Reduce(function(x,y) merge(x,y,by =intersect(names(x),names(y)),all.x = TRUE,all.y = TRUE),b)
b1=max(match(names(unlist(b)),LETTERS))
e=LETTERS[1:b1][!LETTERS[1:b1]%in%names(d)]
f=`names<-`(data.frame(d,rep(list(NA),length(e))),c(names(d),e))
g=f[,order(names(f))]
Reduce(rbind,c(paste0(names(g),collapse = " | "),apply(g,1,paste0,collapse="|")))
If the data is
x <- "A:B:C:D:F | 1.1:2.1:3.1:4.1:6.1
A:B:D:F | 1.2:2.2:4.2:6.2
A:B:C:F | 1.3:2.3:3.3:6.3
B:C:D:F | 2.4:3.4:4.4:6.4"
Then you just need to transform this with the one line code below and run the above code.
df1=matrix(unlist(strsplit(unlist(strsplit(x,"\n"))," | ",fixed = T)),ncol=2,byrow=T)
Upvotes: 1
Reputation: 965
Here's a solution in just base R without the magic of tidyverse. It assumes you can read in all your data as a big string but it's not too hard to alter it for an input stream.
x <- "A:B:C:D:F | 1.1:2.1:3.1:4.1:6.1
A:B:D:F | 1.2:2.2:4.2:6.2
A:B:C:F | 1.3:2.3:3.3:6.3
B:C:D:F | 2.4:3.4:4.4:6.4"
data <- unlist(str_split(x, "\n"))
result <- matrix(as.numeric(NA), nrow = length(data), ncol = 6)
colnames(result) <- c("A", "B", "C", "D", "E", "F")
for (i in 1:length(data)) {
split_data <- unlist(str_split(data[i], " [|] "))
print(split_data)
indices <- unlist(str_split(split_data[1], ":"))
values <- unlist(str_split(split_data[2], ":"))
for (j in 1:length(indices)) {
result[i, indices[j]] <- as.numeric(values[j])
}
}
result
Upvotes: 3
Reputation: 887671
We can try with tidyverse
library(tidyverse)
res <- df1 %>%
map(~strsplit(., ":")) %>%
transpose %>%
map(~set_names(as.data.frame.list(as.numeric(.[[2]])), .[[1]])) %>%
bind_rows %>%
right_join(as.data.frame(setNames(rep(list(NA), 6), LETTERS[1:6])), .)
res
# A B C D E F
#1 1.1 2.1 3.1 4.1 NA 6.1
#2 1.2 2.2 NA 4.2 NA 6.2
#3 1.3 2.3 3.3 NA NA 6.3
#4 NA 2.4 3.4 4.4 NA 6.4
df1 <- structure(list(col1 = c("A:B:C:D:F", "A:B:D:F", "A:B:C:F", "B:C:D:F"
), col2 = c("1.1:2.1:3.1:4.1:6.1", "1.2:2.2:4.2:6.2", "1.3:2.3:3.3:6.3",
"2.4:3.4:4.4:6.4")), .Names = c("col1", "col2"), class = "data.frame",
row.names = c(NA, -4L))
Upvotes: 2