Reputation: 11
I would like to kindly ask for the help of the community in reshaping a text file. The text file looks like this:
TRINITY_GG_17866_c6_g1_i1
TRINITY_GG_17866_c3_g1_i1
TRINITY_GG_17866_c1_g1_i7
GO:0000226
GO:0006139
GO:0006259
TRINITY_GG_17866_c5_g1_i1
GO:0003674
GO:0005488
What I would like to get in the end is like this
TRINITY_GG_17866_c6_g1_i1
TRINITY_GG_17866_c3_g1_i1
TRINITY_GG_17866_c1_g1_i7 GO:0000226,GO:0006139,GO:0006259
TRINITY_GG_17866_c5_g1_i1 GO:0003674,GO:0005488
I could not come up with any solutions so far on how to do this. I would really appreciate any advice on this issue.
Best wishes, Ferenc
Upvotes: 0
Views: 29
Reputation: 79188
You could do:
dat <- readLines("yourfile.txt")
cat(tapply(dat, cumsum(grepl("^TRINITY",dat)), toString), sep="\n", file = "newfile.txt")
Upvotes: 2
Reputation: 328
I like the tidyverse solution but I would honestly opt for some quick and dirty R in this case:
df <- read.table("stack_overflow.csv", stringsAsFactors = FALSE) # Read it in
result <- list() # Initialize the result
for (row in as.vector(df$V1)) {
if (startsWith(row, "TRINITY")) { # If it starts with "TRINITY" then start a new row in result
result[[length(result)+1]] <- c(row)
}
else { # Otherwise, append whatever is there to the current row
if (grepl(" ", result[[length(result)]])) { # If it already has a space in it (already has a GO appended), add a comma
result[[length(result)]] <- paste0(result[[length(result)]], ",", row)
}
else { # Otherwise just add a space
result[[length(result)]] <- paste0(result[[length(result)]], " ", row)
}
}
}
result <- sapply(result, function(x){return(x)}) # Convert to vector
print(result) # Print it so you can check it out
write.table(result, file="formatted_file", row.names = FALSE, quote = FALSE, col.names = FALSE) # Write the table
Upvotes: 0
Reputation: 886938
We can create a grouping column based on the occurence of substring in the column and then extract
library(dplyr)
library(tidyr)
df1 %>%
group_by(grp = cumsum(startsWith(v1, 'TRINITY'))) %>%
summarise(value1 = v1[1], value2 = case_when(n() > 1
~ str_c(v1[-1], collapse=","), TRUE ~ '')) %>%
select(-grp)
# A tibble: 4 x 2
# value1 value2
# <chr> <chr>
#1 TRINITY_GG_17866_c6_g1_i1 ""
#2 TRINITY_GG_17866_c3_g1_i1 ""
#3 TRINITY_GG_17866_c1_g1_i7 "GO:0000226,GO:0006139,GO:0006259"
#4 TRINITY_GG_17866_c5_g1_i1 "GO:0003674,GO:0005488"
df1 <- structure(list(v1 = c("TRINITY_GG_17866_c6_g1_i1", "TRINITY_GG_17866_c3_g1_i1",
"TRINITY_GG_17866_c1_g1_i7", "GO:0000226", "GO:0006139", "GO:0006259",
"TRINITY_GG_17866_c5_g1_i1", "GO:0003674", "GO:0005488")), class = "data.frame", row.names = c(NA,
-9L))
Upvotes: 0