from text file to csv using Python

Question

I need help in parsing a very long text file which looks like:

NAME         IMP4   
DESCRIPTION  small nucleolar ribonucleoprotein 
CLASS        Genetic Information Processing
             Translation
             Ribosome biogenesis in eukaryotes
DBLINKS      NCBI-GI: 15529982
             NCBI-GeneID: 92856
             OMIM: 612981
///
NAME         COMMD9
DESCRIPTION  COMM domain containing 9
ORGANISM     H.sapiens
DBLINKS      NCBI-GI: 156416007
             NCBI-GeneID: 29099
             OMIM: 612299
///
.....

I want to obtain a structured csv file, with the same number of columns in every row, in order to extract easily the information I need.

First I tried in this way:

for line in a:
    if '///' not in line:
        b.write(''.join(line.replace('
', '	')))
    else:
    b.write('
')

obtaining a csv like this:

NAME         IMP4	DESCRIPTION  small nucleolar ribonucleoprotein	CLASS        Genetic Information Processing	             Translation	             Ribosome biogenesis in eukaryotes	DBLINKS      NCBI-GI: 15529982	            NCBI-GeneID: 92856	
         OMIM: 612981
NAME         COMMD9	DESCRIPTION  COMM domain containing 9	ORGANISM     H.sapiens	DBLINKS      NCBI-GI: 156416007	             NCBI-GeneID: 29099t\             OMIM: 612299

The main problem is given by the fact that fields like DBLINKS, that in the original file are in multiple lines, in this way result split in several fields, while I need to have it all in one. Moreover, not all the fields are present in every line, for instance the fields 'CLASS' and 'ORGANISM' in the example.

The file I'd like to obtain should look like:

NAME         IMP4	DESCRIPTION  small nucleolar ribonucleoprotein	NA	CLASS        Genetic Information Processing; Translation; Ribosome biogenesis in eukaryotes	DBLINKS      NCBI-GI: 15529982; NCBI-GeneID: 92856; OMIM: 612981
NAME         COMMD9	DESCRIPTION  COMM domain containing 9	ORGANISM     H.sapiens	NA	DBLINKS      NCBI-GI: 156416007; NCBI-GeneID: 29099; OMIM: 612299

Could you please help me?

unutbu · Accepted Answer

You could use itertools.groupby, once to collect lines into records, and a second time to collect multi-line fields into an iterator:

import csv
import itertools

def is_end_of_record(line):
    return line.startswith('///')

class FieldClassifier(object):
    def __init__(self):
        self.field=''
    def __call__(self,row):
        if not row[0].isspace():
            self.field=row.split(' ',1)[0]
        return self.field

fields='NAME DESCRIPTION ORGANISM CLASS DBLINKS'.split()
with open('data','r') as f:
    for end_of_record, lines in itertools.groupby(f,is_end_of_record):
        if not end_of_record:
            classifier=FieldClassifier()
            record={}
            for fieldname, row in itertools.groupby(lines,classifier):
                record[fieldname]='; '.join(r.strip() for r in row)
            print('	'.join(record.get(fieldname,'NA') for fieldname in fields))

yields

NAME         IMP4   DESCRIPTION  small nucleolar ribonucleoprotein  NA  CLASS        Genetic Information Processing; Translation; Ribosome biogenesis in eukaryotes DBLINKS      NCBI-GI: 15529982; NCBI-GeneID: 92856; OMIM: 612981
NAME         COMMD9 DESCRIPTION  COMM domain containing 9   ORGANISM     H.sapiens  NA  DBLINKS      NCBI-GI: 156416007; NCBI-GeneID: 29099; OMIM: 612299

Above is the output as you would see it printed. It matches the desired output you posted, assuming you are showing the repr of that output.

References to tools used:

itertools.groupby
a class with a __call__ method
str.join with a generator expression for which it helps to first understand list comprehension
dict.get method with a default value

from text file to csv using Python

Answers (2)

Related Questions