Reputation: 15
My dataframe (df):
ID1 | ID2 | V1 | V2 | V3
A | B | var1 | foo | 1
C | D | var2 | bar | 2
E | F | var3 | foo | 3
G | F | var3 | foo | 3
H | I | var4 | zap | 2
...
ID1 and ID2 contain overlapping values, as it's a long format version of an upper matrix triangle with identical comparsions (eg A, A) removed and some additional metadata (V1,V2,V3) added.
The above must be grouped by V1, V2 & V3, and the final output is to be a list of IDs (ID1 and ID2 contain overlapping variables) that make up each group (with each list being a seperate file).
So far, I've grouped the variables but stuck on how to go on about iterating through each of dplyr's groups and obtaining the values for each.
A pseudocode of what I have in mind is below:
# Group
cluster <- df %>% group_by(V1,V2,V3)
[?] # loop through each group in cluster
[?] # get group values as x, y and z
# Get IDs into lists and merge
ID1 <- df %>% filter(V1 == x, V2 ==y, V3 == z) %>%
pull(ID1)
ID2 <- df %>% filter(V1 == x, V2 ==y, V3 == z) %>%
pull(ID2)
merged <- c(ID1,ID2)
merged_unique <- unique(unlist(merged))
# Print out to file
fileConn <- file(paste(X ,Y, Z,"txt", sep="."))
writeLines(merged_unique, fileConn)
close(fileConn)
I would like my final output to be:
A
B
C
D
E
F
G
H
I
Any help is appreciated.
Upvotes: 1
Views: 1168
Reputation: 102700
I am not sure about the expected output. Hope the code below could help a bit
lapply(
split(
df[c("ID1", "ID2")],
with(df, do.call(paste, list(V1, V2, V3)))
),
function(v) unique(unlist(v))
)
which gives
$`var1 foo 1`
[1] "A" "B"
$`var2 bar 2`
[1] "C" "D"
$`var3 foo 3`
[1] "E" "G" "F"
$`var4 zap 2`
[1] "H" "I"
If you want to save all the groups into different *.txt
files, you can try the code below
lst <- lapply(
split(
df[c("ID1", "ID2")],
with(df, do.call(paste, list(V1, V2, V3,sep = "_")))
),
function(v) unique(unlist(v))
)
sapply(seq_along(lst),function(k) writeLines(lst[[k]],paste0(names(lst[k]),".txt")))
Upvotes: 2
Reputation: 183
Generating your "data":
df <- data.frame("ID1" = c("A","B","C","E","G","H"), "ID2" = c("B","B","D","Fe","Fe","I"), "V1" = c("var1","var1","var2","var3","var3","var4"),"V2" = c("foo","foo","bar","foo","foo","zed"), "V3" = c(1,1,2,3,3,2))
Clustering the data and getting the unique clusters:
library(dplyr)
df_clust <- df %>% group_by(V1,V2,V3)
df_tally <- df_clust %>% tally()
Looping through, assuming only two ID columns and 3 feature columns, and printing each result to a new file:
for (i in c(1:nrow(df_tally))){
pull1 <- df %>% filter(V1 == unlist(df_tally[i,1]), V2 == unlist(df_tally[i,2]), V3 == unlist(df_tally[i,3])) %>% pull(ID1)
pull2 <- df %>% filter(V1 == unlist(df_tally[i,1]), V2 == unlist(df_tally[i,2]), V3 == unlist(df_tally[i,3])) %>% pull(ID2)
mergeID <- c(type.convert(pull1,as.is = TRUE),type.convert(pull2,as.is = TRUE))
mergeID <- unique(mergeID)
filename <- paste("TEST_",i, ".txt", sep="")
fileConn<-file(filename)
writeLines(mergeID, fileConn)
close(fileConn)
}
Upvotes: 1