Ferenc Kagan
Ferenc Kagan

Reputation: 11

Is there a way to reformat the structure of the following text file

I would like to kindly ask for the help of the community in reshaping a text file. The text file looks like this:

TRINITY_GG_17866_c6_g1_i1
TRINITY_GG_17866_c3_g1_i1
TRINITY_GG_17866_c1_g1_i7
GO:0000226
GO:0006139
GO:0006259
TRINITY_GG_17866_c5_g1_i1
GO:0003674
GO:0005488

What I would like to get in the end is like this

TRINITY_GG_17866_c6_g1_i1
TRINITY_GG_17866_c3_g1_i1
TRINITY_GG_17866_c1_g1_i7 GO:0000226,GO:0006139,GO:0006259
TRINITY_GG_17866_c5_g1_i1 GO:0003674,GO:0005488

I could not come up with any solutions so far on how to do this. I would really appreciate any advice on this issue.

Best wishes, Ferenc

Upvotes: 0

Views: 29

Answers (3)

Onyambu
Onyambu

Reputation: 79188

You could do:

dat <- readLines("yourfile.txt")
cat(tapply(dat, cumsum(grepl("^TRINITY",dat)), toString), sep="\n", file = "newfile.txt")

Upvotes: 2

Derek Fulton
Derek Fulton

Reputation: 328

I like the tidyverse solution but I would honestly opt for some quick and dirty R in this case:

df <- read.table("stack_overflow.csv", stringsAsFactors = FALSE)  # Read it in
result <- list()  # Initialize the result
for (row in as.vector(df$V1)) {
  if (startsWith(row, "TRINITY")) {  # If it starts with "TRINITY" then start a new row in result
    result[[length(result)+1]] <- c(row)
  }
  else {  # Otherwise, append whatever is there to the current row
    if (grepl(" ", result[[length(result)]])) {  # If it already has a space in it (already has a GO appended), add a comma
      result[[length(result)]] <- paste0(result[[length(result)]], ",", row) 
    }
    else {  # Otherwise just add a space
      result[[length(result)]] <- paste0(result[[length(result)]], " ", row)
    }
  }
}
result <- sapply(result, function(x){return(x)})  # Convert to vector
print(result)  # Print it so you can check it out
write.table(result, file="formatted_file", row.names = FALSE, quote = FALSE, col.names = FALSE)  # Write the table

Upvotes: 0

akrun
akrun

Reputation: 886938

We can create a grouping column based on the occurence of substring in the column and then extract

library(dplyr)
library(tidyr)
df1 %>%
     group_by(grp = cumsum(startsWith(v1, 'TRINITY'))) %>% 
     summarise(value1 = v1[1], value2 = case_when(n() > 1 
           ~ str_c(v1[-1], collapse=","), TRUE ~ '')) %>%
      select(-grp)
# A tibble: 4 x 2
#  value1                    value2                            
#  <chr>                     <chr>                             
#1 TRINITY_GG_17866_c6_g1_i1 ""                                
#2 TRINITY_GG_17866_c3_g1_i1 ""                                
#3 TRINITY_GG_17866_c1_g1_i7 "GO:0000226,GO:0006139,GO:0006259"
#4 TRINITY_GG_17866_c5_g1_i1 "GO:0003674,GO:0005488"         

data

df1 <- structure(list(v1 = c("TRINITY_GG_17866_c6_g1_i1", "TRINITY_GG_17866_c3_g1_i1", 
"TRINITY_GG_17866_c1_g1_i7", "GO:0000226", "GO:0006139", "GO:0006259", 
"TRINITY_GG_17866_c5_g1_i1", "GO:0003674", "GO:0005488")), class = "data.frame", row.names = c(NA, 
-9L))

Upvotes: 0

Related Questions