Sonny
Sonny

Reputation: 75

from text file to csv using Python

I need help in parsing a very long text file which looks like:

NAME         IMP4   
DESCRIPTION  small nucleolar ribonucleoprotein 
CLASS        Genetic Information Processing
             Translation
             Ribosome biogenesis in eukaryotes
DBLINKS      NCBI-GI: 15529982
             NCBI-GeneID: 92856
             OMIM: 612981
///
NAME         COMMD9
DESCRIPTION  COMM domain containing 9
ORGANISM     H.sapiens
DBLINKS      NCBI-GI: 156416007
             NCBI-GeneID: 29099
             OMIM: 612299
///
.....

I want to obtain a structured csv file, with the same number of columns in every row, in order to extract easily the information I need.

First I tried in this way:

for line in a:
    if '///' not in line:
        b.write(''.join(line.replace('\n', '\t')))
    else:
    b.write('\n')

obtaining a csv like this:

NAME         IMP4\tDESCRIPTION  small nucleolar ribonucleoprotein\tCLASS        Genetic Information Processing\t             Translation\t             Ribosome biogenesis in eukaryotes\tDBLINKS      NCBI-GI: 15529982\t            NCBI-GeneID: 92856\t
         OMIM: 612981
NAME         COMMD9\tDESCRIPTION  COMM domain containing 9\tORGANISM     H.sapiens\tDBLINKS      NCBI-GI: 156416007\t             NCBI-GeneID: 29099t\             OMIM: 612299

The main problem is given by the fact that fields like DBLINKS, that in the original file are in multiple lines, in this way result split in several fields, while I need to have it all in one. Moreover, not all the fields are present in every line, for instance the fields 'CLASS' and 'ORGANISM' in the example.

The file I'd like to obtain should look like:

NAME         IMP4\tDESCRIPTION  small nucleolar ribonucleoprotein\tNA\tCLASS        Genetic Information Processing; Translation; Ribosome biogenesis in eukaryotes\tDBLINKS      NCBI-GI: 15529982; NCBI-GeneID: 92856; OMIM: 612981
NAME         COMMD9\tDESCRIPTION  COMM domain containing 9\tORGANISM     H.sapiens\tNA\tDBLINKS      NCBI-GI: 156416007; NCBI-GeneID: 29099; OMIM: 612299

Could you please help me?

Upvotes: 3

Views: 3062

Answers (2)

unutbu
unutbu

Reputation: 880767

You could use itertools.groupby, once to collect lines into records, and a second time to collect multi-line fields into an iterator:

import csv
import itertools

def is_end_of_record(line):
    return line.startswith('///')

class FieldClassifier(object):
    def __init__(self):
        self.field=''
    def __call__(self,row):
        if not row[0].isspace():
            self.field=row.split(' ',1)[0]
        return self.field

fields='NAME DESCRIPTION ORGANISM CLASS DBLINKS'.split()
with open('data','r') as f:
    for end_of_record, lines in itertools.groupby(f,is_end_of_record):
        if not end_of_record:
            classifier=FieldClassifier()
            record={}
            for fieldname, row in itertools.groupby(lines,classifier):
                record[fieldname]='; '.join(r.strip() for r in row)
            print('\t'.join(record.get(fieldname,'NA') for fieldname in fields))

yields

NAME         IMP4   DESCRIPTION  small nucleolar ribonucleoprotein  NA  CLASS        Genetic Information Processing; Translation; Ribosome biogenesis in eukaryotes DBLINKS      NCBI-GI: 15529982; NCBI-GeneID: 92856; OMIM: 612981
NAME         COMMD9 DESCRIPTION  COMM domain containing 9   ORGANISM     H.sapiens  NA  DBLINKS      NCBI-GI: 156416007; NCBI-GeneID: 29099; OMIM: 612299

Above is the output as you would see it printed. It matches the desired output you posted, assuming you are showing the repr of that output.


References to tools used:

Upvotes: 5

Marijn van Vliet
Marijn van Vliet

Reputation: 5409

This script will convert your textfile into a valid CSV file (that can be read with Excel for instance):

import sys
from sets import Set

if len(sys.argv) < 2:
    print 'Usage: %s <input-file> <output-file>' % sys.argv[0]
    sys.exit(1)

entries = []
entry = {}

# Read the input file
with open(sys.argv[1]) as input:
    lines = input.readlines()

for line in lines:
    # Check for beginning of new entry
    if line.strip() == '///':
        if len(entry) > 0:
            entries.append(entry)
        entry = {}
        continue

    # Check for presense of key
    possible_key = line[:13].strip()
    if possible_key != '':
        key = possible_key
        entry[key] = []

    # Assemble the value
    if key:
        entry[key].append(line[13:].strip())

# Append the last entry
if len(entry) > 0:
    entries.append(entry)

# 'entries' now contains a list of a dict of a list

# Find out all possible keys
all_keys = Set()
for entry in entries:
    all_keys.union_update(entry.keys())

# Write all entries to the output file
with open(sys.argv[2], 'w') as output:
    # The first line will contain the keys
    output.write(','.join(['"%s"' % key for key in sorted(all_keys)]))
    output.write('\r\n')

    # Write each entry
    for entry in entries:
       output.write(','.join(['"%s"' % ';'.join(entry[key]) if key in entry else '' for key in sorted(all_keys)]))
       output.write('\r\n')

Upvotes: 0

Related Questions