Reputation: 548
I need to convert a huge number of files in structured text format into excel (csv would work) to be able to merge them with some other data I have. Here is a sample of the text:
FILER:
COMPANY DATA:
COMPANY CONFORMED NAME: NORTHQUEST CAPITAL FUND INC
CENTRAL INDEX KEY: 0001142728
IRS NUMBER: 223772454
STATE OF INCORPORATION: NJ
FISCAL YEAR END: 1231
FILING VALUES:
FORM TYPE: NSAR-A
SEC ACT: 1940 Act
SEC FILE NUMBER: 811-10419
FILM NUMBER: 03805344
BUSINESS ADDRESS:
STREET 1: 16 RIMWOOD LANE
CITY: COLTS NECK
STATE: NJ
ZIP: 07722
BUSINESS PHONE: 7328423504
FORMER COMPANY:
FORMER CONFORMED NAME: NORTHPOINT CAPITAL FUND INC
DATE OF NAME CHANGE: 20010615
</SEC-HEADER>
<DOCUMENT>
<TYPE>NSAR-A
<SEQUENCE>1
<FILENAME>answer.fil
<DESCRIPTION>ANSWER.FIL
<TEXT>
<PAGE> PAGE 1
000 A000000 06/30/2003
000 C000000 0001142728
000 D000000 N
000 E000000 NF
000 F000000 Y
000 G000000 N
000 H000000 N
000 I000000 6.1
000 J000000 A
001 A000000 NORTHQUEST CAPITAL FUND, INC.
001 B000000 811-10493
001 C000000 7328921057
002 A000000 16 RIMWOOD LANE
002 B000000 COLTS NECK
002 C000000 NJ
002 D010000 07722
003 000000 N
004 000000 N
005 000000 N
006 000000 N
007 A000000 N
007 B000000 0
007 C010100 1
007 C010200 2
007 C010300 3
007 C010400 4
007 C010500 5
007 C010600 6
007 C010700 7
007 C010800 8
007 C010900 9
007 C011000 10
008 A000001 EMERALD RESEARCH CORP.
008 B000001 A
008 C000001 801-60455
008 D010001 BRICK
008 D020001 NJ
008 D030001 08724
013 A000001 SANVILLE & COMPANY
013 B010001 ABINGTON
013 B020001 PA
013 B030001 19001
015 A000001 FLEET BANK
015 B000001 C
015 C010001 POINT PLEASANT BEACH
015 C020001 NJ
015 C030001 08742
015 E030001 X
018 000000 Y
019 A000000 N
019 B000000 0
<PAGE> PAGE 2
020 A000001 SCHWAB
020 B000001 94-1737782
020 C000001 0
020 A000002 BESTVEST BROOKERAGE
020 B000002 23-1452837
020 C000002 0
and it continues to page 8 of the same structure. The information about the company's name should go into relative columns and the rest should be like the first two values are the column names and the third value would be the value of the row.
I was trying to work it out with pyparsing but haven't been able to successfully do so. Any comment on the approach would be helpful.
Upvotes: 1
Views: 1304
Reputation: 19770
The way you describe it, these are like key:value pairs for each file. I would handle the parsing part like this:
import sys
import re
import csv
colonseperated = re.compile(' *(.+) *: *(.+) *')
fixedfields = re.compile('(\d{3} \w{7}) +(.*)')
matchers = [colonseperated, fixedfields]
outfile = csv.writer(open('out.csv', 'w'))
outfile.writerow(['Filename', 'Key', 'Value'])
for filename in sys.argv[1:]:
for line in open(filename):
line = line.strip()
for matcher in matchers:
match = matcher.match(line)
if match:
outfile.writerow([filename] + list(match.groups()))
You can call this something like parser.py
and call it with python parser.py *.infile
or whatever your filename convention is. It will create a csv file with three columns: a filename, a key and a value. You can open this in excel and then use a pivot table to get the values into the correct format.
Alternatively you can use this:
import csv
headers = []
rows = {}
filenames = []
outfile = csv.writer(open('flat.csv', 'w'))
infile = csv.reader(open('out.csv'))
infile.next()
for filename, key, value in infile:
if not filename in rows:
rows[filename] = {}
filenames.append(filename)
if key not in headers:
headers.append(key)
rows[filename][key] = value
outfile.writerow(headers)
for filename in filenames:
outfile.writerow([rows[filename].get(header, '') for header in headers])
Upvotes: 2