Reputation: 75
I need help in parsing a very long text file which looks like:
NAME IMP4
DESCRIPTION small nucleolar ribonucleoprotein
CLASS Genetic Information Processing
Translation
Ribosome biogenesis in eukaryotes
DBLINKS NCBI-GI: 15529982
NCBI-GeneID: 92856
OMIM: 612981
///
NAME COMMD9
DESCRIPTION COMM domain containing 9
ORGANISM H.sapiens
DBLINKS NCBI-GI: 156416007
NCBI-GeneID: 29099
OMIM: 612299
///
.....
I want to obtain a structured csv file, with the same number of columns in every row, in order to extract easily the information I need.
First I tried in this way:
for line in a:
if '///' not in line:
b.write(''.join(line.replace('\n', '\t')))
else:
b.write('\n')
obtaining a csv like this:
NAME IMP4\tDESCRIPTION small nucleolar ribonucleoprotein\tCLASS Genetic Information Processing\t Translation\t Ribosome biogenesis in eukaryotes\tDBLINKS NCBI-GI: 15529982\t NCBI-GeneID: 92856\t
OMIM: 612981
NAME COMMD9\tDESCRIPTION COMM domain containing 9\tORGANISM H.sapiens\tDBLINKS NCBI-GI: 156416007\t NCBI-GeneID: 29099t\ OMIM: 612299
The main problem is given by the fact that fields like DBLINKS, that in the original file are in multiple lines, in this way result split in several fields, while I need to have it all in one. Moreover, not all the fields are present in every line, for instance the fields 'CLASS' and 'ORGANISM' in the example.
The file I'd like to obtain should look like:
NAME IMP4\tDESCRIPTION small nucleolar ribonucleoprotein\tNA\tCLASS Genetic Information Processing; Translation; Ribosome biogenesis in eukaryotes\tDBLINKS NCBI-GI: 15529982; NCBI-GeneID: 92856; OMIM: 612981
NAME COMMD9\tDESCRIPTION COMM domain containing 9\tORGANISM H.sapiens\tNA\tDBLINKS NCBI-GI: 156416007; NCBI-GeneID: 29099; OMIM: 612299
Could you please help me?
Upvotes: 3
Views: 3062
Reputation: 880767
You could use itertools.groupby, once to collect lines into records, and a second time to collect multi-line fields into an iterator:
import csv
import itertools
def is_end_of_record(line):
return line.startswith('///')
class FieldClassifier(object):
def __init__(self):
self.field=''
def __call__(self,row):
if not row[0].isspace():
self.field=row.split(' ',1)[0]
return self.field
fields='NAME DESCRIPTION ORGANISM CLASS DBLINKS'.split()
with open('data','r') as f:
for end_of_record, lines in itertools.groupby(f,is_end_of_record):
if not end_of_record:
classifier=FieldClassifier()
record={}
for fieldname, row in itertools.groupby(lines,classifier):
record[fieldname]='; '.join(r.strip() for r in row)
print('\t'.join(record.get(fieldname,'NA') for fieldname in fields))
yields
NAME IMP4 DESCRIPTION small nucleolar ribonucleoprotein NA CLASS Genetic Information Processing; Translation; Ribosome biogenesis in eukaryotes DBLINKS NCBI-GI: 15529982; NCBI-GeneID: 92856; OMIM: 612981
NAME COMMD9 DESCRIPTION COMM domain containing 9 ORGANISM H.sapiens NA DBLINKS NCBI-GI: 156416007; NCBI-GeneID: 29099; OMIM: 612299
Above is the output as you would see it printed. It matches the desired output you posted, assuming you are showing the repr
of that output.
References to tools used:
__call__
methodUpvotes: 5
Reputation: 5409
This script will convert your textfile into a valid CSV file (that can be read with Excel for instance):
import sys
from sets import Set
if len(sys.argv) < 2:
print 'Usage: %s <input-file> <output-file>' % sys.argv[0]
sys.exit(1)
entries = []
entry = {}
# Read the input file
with open(sys.argv[1]) as input:
lines = input.readlines()
for line in lines:
# Check for beginning of new entry
if line.strip() == '///':
if len(entry) > 0:
entries.append(entry)
entry = {}
continue
# Check for presense of key
possible_key = line[:13].strip()
if possible_key != '':
key = possible_key
entry[key] = []
# Assemble the value
if key:
entry[key].append(line[13:].strip())
# Append the last entry
if len(entry) > 0:
entries.append(entry)
# 'entries' now contains a list of a dict of a list
# Find out all possible keys
all_keys = Set()
for entry in entries:
all_keys.union_update(entry.keys())
# Write all entries to the output file
with open(sys.argv[2], 'w') as output:
# The first line will contain the keys
output.write(','.join(['"%s"' % key for key in sorted(all_keys)]))
output.write('\r\n')
# Write each entry
for entry in entries:
output.write(','.join(['"%s"' % ';'.join(entry[key]) if key in entry else '' for key in sorted(all_keys)]))
output.write('\r\n')
Upvotes: 0