Paulo Sergio Schlogl
Paulo Sergio Schlogl

Reputation: 504

How to make a csv row for each 2 lines in a txt file

I have a text file like this:

Viruses/GCF_000820355.1_ViralMultiSegProj14361_genomic.fna.gz
Sclerophthora macrospora virus A
Viruses/GCF_000820495.2_ViralMultiSegProj14656_genomic.fna.gz
Influenza B virus RNA
Viruses/GCF_000837105.1_ViralMultiSegProj14079_genomic.fna.gz
Tomato mottle virus

And I need to get a csv file like this:

Viruses/GCF_000820355.1_ViralMultiSegProj14361_genomic.fna.gz,Sclerophthora macrospora virus A
Viruses/GCF_000820495.2_ViralMultiSegProj14656_genomic.fna.gz,Influenza B virus RNA
Viruses/GCF_000837105.1_ViralMultiSegProj14079_genomic.fna.gz,Tomato mottle virus

Because later I want to use this like a tuple to find the compressed file, read it and get a final file with names like:

Viruses/GCF_000837105.1/Tomato mottle virus.fna

I just need to learn how to do the first part of the problem. It could by with:

Any help would be very appreciated. This is hard for me to accomplish because the original filenames are very messed up.

I have tried this:

sed -z 's/\n/,/g;s/,$/\n/' multi_headers

However it put comma in all \n.

Upvotes: 4

Views: 254

Answers (9)

Michail Alexakis
Michail Alexakis

Reputation: 1585

To add yet another solution into the mix, you can also use xargs and group input lines by 2, then replace first space with ',' in each output line.

xargs -n2 -d'\n' -a input.txt | sed 's/ /,/'

Upvotes: 1

Freddy Mcloughlan
Freddy Mcloughlan

Reputation: 4496

A simple writerows():

import csv

with open("text.txt", "r") as f:
    with open("data.csv", "w", newline="") as w:
        writer = csv.writer(w)
        # May want to add headers
        writer.writerow(["Heading1", "Heading2"])
        x = f.readlines()
        writer.writerows([x[i:i+2] for i in range(0, len(x), 2)])

Which yields:

Heading1,Heading2
Viruses/GCF_000820355.1_ViralMultiSegProj14361_genomic.fna.gz,Sclerophthora macrospora virus A
Viruses/GCF_000820495.2_ViralMultiSegProj14656_genomic.fna.gz,Influenza B virus RNA
Viruses/GCF_000837105.1_ViralMultiSegProj14079_genomic.fna.gz,Tomato mottle virus

Upvotes: 2

Daweo
Daweo

Reputation: 36430

Simple python3 solution, let file.txt content be

Viruses/GCF_000820355.1_ViralMultiSegProj14361_genomic.fna.gz
Sclerophthora macrospora virus A
Viruses/GCF_000820495.2_ViralMultiSegProj14656_genomic.fna.gz
Influenza B virus RNA
Viruses/GCF_000837105.1_ViralMultiSegProj14079_genomic.fna.gz
Tomato mottle virus

and script.py

with open("file.txt","r") as f:
    for inx, line in enumerate(f):
        print(line.rstrip(), end='\n' if inx%2 else ',')

then

python script.py

output

Viruses/GCF_000820355.1_ViralMultiSegProj14361_genomic.fna.gz,Sclerophthora macrospora virus A
Viruses/GCF_000820495.2_ViralMultiSegProj14656_genomic.fna.gz,Influenza B virus RNA
Viruses/GCF_000837105.1_ViralMultiSegProj14079_genomic.fna.gz,Tomato mottle virus

Explanation: I use .rstrip to jettison trailing newline, then depending on whatever line is odd or even I apply \n or , respectively as line end. Note that enumerate default is starting at 0 as opposed to GNU AWK starting at 1. Note that using for ... in filehandle does prevent loading whole file as once, so this solution could be used also for files bigger than available RAM space.

Upvotes: 1

benson23
benson23

Reputation: 19097

Bash

You can do a paste (thanks @glenn jackman for pointing out my previous useless use of cat).

# or cat mytext.txt | paste -d "," - -
paste -d "," - - < mytext.txt 

Viruses/GCF_000820355.1_ViralMultiSegProj14361_genomic.fna.gz,Sclerophthora macrospora virus A
Viruses/GCF_000820495.2_ViralMultiSegProj14656_genomic.fna.gz,Influenza B virus RNA
Viruses/GCF_000837105.1_ViralMultiSegProj14079_genomic.fna.gz,Tomato mottle virus

R

The R function is also paste, together with sapply:

mytext <- scan("mytext.txt", character(), sep = "\n")

sapply(seq(1, length(mytext), 2), function(x) paste(mytext[x], mytext[x + 1], sep = ","))
[1] "Viruses/GCF_000820355.1_ViralMultiSegProj14361_genomic.fna.gz,Sclerophthora macrospora virus A"
[2] "Viruses/GCF_000820495.2_ViralMultiSegProj14656_genomic.fna.gz,Influenza B virus RNA"           
[3] "Viruses/GCF_000837105.1_ViralMultiSegProj14079_genomic.fna.gz,Tomato mottle virus"   

Upvotes: 4

potong
potong

Reputation: 58401

This might work for you (GNU sed and paste):

sed 'N;s/\n/,/' file

Append the next line to the current line and replace the newline between then with a comma.

or:

paste -sd',\n' file

Paste the file as one long string, replacing every other newline with a comma.

Upvotes: 2

Donald Seinen
Donald Seinen

Reputation: 4419

Another R approach, relying on vector recycling.

t = readLines("txt.txt")
paste0(t[c(T, F)], ",", t[c(F, T)]) |> writeLines("txt.csv")

or for final file names

t = readLines("R/txt.txt")
sub("(?<=\\.\\d).*", "", t, perl = T) |>
  (\(.) paste0(.[c(T, F)], "/", .[c(F, T)], ".fna"))()

#> [1] "Viruses/GCF_000820355.1/Sclerophthora macrospora virus A.fna"
#> [2] "Viruses/GCF_000820495.2/Influenza B virus RNA.fna"           
#> [3] "Viruses/GCF_000837105.1/Tomato mottle virus.fna"  

Upvotes: 1

Ed Morton
Ed Morton

Reputation: 203493

Using any awk in any shell on every Unix box and only storing 1 line at a time in memory so it'll work no matter how large your input file is:

$ awk '{ORS=(NR%2 ? "," : RS)} 1' file
Viruses/GCF_000820355.1_ViralMultiSegProj14361_genomic.fna.gz,Sclerophthora macrospora virus A
Viruses/GCF_000820495.2_ViralMultiSegProj14656_genomic.fna.gz,Influenza B virus RNA
Viruses/GCF_000837105.1_ViralMultiSegProj14079_genomic.fna.gz,Tomato mottle virus

There's a lot happening in a small amount of code above so here's an explanation:

  • ORS is the builtin variable containing the string to be printed at the end of each output record (record = line in this case), a newline by default.
  • RS is the builtin variable containing the string (or regexp) that separates each input record, a newline by default.
  • NR is the builtin variable containing the current record/line number so NR%2 is 1 for odd numbered records and 0 for even numbered.
  • NR%2 ? "," : RS is a ternary expression resulting in , for odd numbered lines, \n (or whatever else you have set RS to, e.g. \r\n) for even numbered.
  • 1 is a true condition which causes the default action of printing the current record to be executed.

So the above script says "if the current line number is odd print it with a , at the end, otherwise print it with a newline at the end", hence it's joining every pair of lines with a , between.

Upvotes: 5

sseLtaH
sseLtaH

Reputation: 11217

Using sed

$ sed '/^Viruses/{N;s/\n\(.*\)/,\1/}' multi_headers
Viruses/GCF_000820355.1_ViralMultiSegProj14361_genomic.fna.gz,Sclerophthora macrospora virus A
Viruses/GCF_000820495.2_ViralMultiSegProj14656_genomic.fna.gz,Influenza B virus RNA
Viruses/GCF_000837105.1_ViralMultiSegProj14079_genomic.fna.gz,Tomato mottle virus

  • /^Viruses/ - Match lines starting with the string Viruses

  • {N; - Read/append the next line of input into the pattern space.

  • s/\n\(.*\)/,\1/ - Remove the \n from the pattern space and replace it with a comma ,

Upvotes: 2

Sharim09
Sharim09

Reputation: 6214

What about this.

with open('test.txt') as f:
    data = f.read().split('\n')
new_data = []

for a in range(0,len(data),2):
    new_data.append(data[a]+','+data[a+1]+'\n')
    
with open('result.txt','w') as f:
    f.writelines(new_data)

or

with open('test.txt') as f_read, open('result.txt','w') as f_write:
    data = f_read.read().split('\n')
    new_data = []

    for a in range(0,len(data),2):
        new_data.append(data[a]+','+data[a+1]+'\n')

    f_write.writelines(new_data)

Upvotes: 1

Related Questions