damorph
damorph

Reputation: 1

Parsing a text file and outputting to new file

I am a complete beginner to Python. I am coming across text files that I need to format. I basically need to take field data that starts with a certain character and output the field data to a new file which will have all the fields delimited by character of my choosing.

Here is a short example.

; Record 1
@FULLTEXT PAGE
@T R000358
@C ENDDOC# R000358
@C BEGATTACH R000358
@C ENDATTACH R000358
@C MAILSTORE No
@C AUTHOR 
@C BCC 
@C CC 
@C COMMENTS 
@C ATTACH 
@C DATECREATED 11/23/2010
@C DATELASTMOD 07/18/2010
@C DATELASTPRNT 
@C DATERCVD 
@C DATESENT 
@C FILENAME wrangling.wpd
@C LASTAUTHOR 
@C ORGANIZATION 
@C REVISION 
@C SUBJECT 
@C TIMEACCESSED 00:00:00
@C TIMECREATED 15:21:34
@C TIMELASTMOD 09:04:12
@C TIMELASTPRNT 
@C TIMERCVD 
@C TIMESENT 
@C TITLE 
@C TO 
@C FROM 

For each 'Record' the '@C' and '@T' is the field delimiter followed by a space, then the field name followed by a space, then the field data. I need all the field data delimited in one row rather then a column as shown above.

I am looking to output to a new file each record to something like this.

"R000358","R000358","R000358","R000358","No",etc, etc. (in one row)

This example is comma delimited but it may change but I figured I would start there.

Any help would be appreciated. Thanks in advance.

Upvotes: 0

Views: 211

Answers (4)

ekhumoro
ekhumoro

Reputation: 120608

record = None
records = []

with open('records.dat') as stream:
    for line in stream:
        item = line.strip().split()
        if not item:
            continue
        if item[0] == ';':
            record = []
            records.append((item[-1], record))
        elif record is not None:
            if item[0] == '@C' and len(item) <= 2:
                record.append('')
            elif item[0] in ('@T', '@C'):
                record.append(item[-1])

for identifier, record in records:
    print '[%s]: %s' % (identifier, ', '.join(record))

Upvotes: 0

hans gruber
hans gruber

Reputation: 1

Start by opening the file:

with open('inputfile','r') as fil:
    # file read-in stuff here

Use the with idiom if you're using python 2.5 and up, otherwise do:

try:
    fil = open('inputfile','r')
    # file read-in stuff here
finally:
    fil.close()

To read the file contents into strings, check out file.readline() (reads one line at a time; use for big files) and file.readlines() (reads entire file into a list, one string per entry) here.

To write the file, use the above logic for reading except open the file in write mode, like this: open('outputfile','w')

To handle formatting for your output file, look at the string methods here. Specifically, take a look at str.split() and str.join(), which let you easily split strings into lists and concatenate list elements into strings by delimiter.

Upvotes: 0

GeneralBecos
GeneralBecos

Reputation: 2556

def getRecordRows( file, start_characters, delimiter):
    returnRows = []
    for line in open(file):
        if line.startswith(start_characters):
             returnRows.append( line[len(start_characters):] )
    return delimiter.join( returnRows )

Example usage:

file = /path/to/file
getRecordRows(file, '@T', ',')

Upvotes: 0

Raymond Hettinger
Raymond Hettinger

Reputation: 226336

It is unclear how the records are delimited and what exactly you would like to do with your output, but here is a simple parser that should get you started:

s = '''\
; Record 1
@FULLTEXT PAGE
@T R000358
@C ENDDOC# R000358
@C BEGATTACH R000358
@C ENDATTACH R000358
@C MAILSTORE No
@C AUTHOR 
@C BCC 
@C CC 
@C COMMENTS 
@C ATTACH 
@C DATECREATED 11/23/2010
@C DATELASTMOD 07/18/2010
@C DATELASTPRNT 
@C DATERCVD 
@C DATESENT 
@C FILENAME wrangling.wpd
@C LASTAUTHOR 
@C ORGANIZATION 
@C REVISION 
@C SUBJECT 
@C TIMEACCESSED 00:00:00
@C TIMECREATED 15:21:34
@C TIMELASTMOD 09:04:12
@C TIMELASTPRNT 
@C TIMERCVD 
@C TIMESENT 
@C TITLE 
@C TO 
@C FROM
'''.splitlines()

records = []
record = {}
for line in s:
    if line.startswith('; Record'):
        record = {}
        records.append(record)
    elif line.startswith(('@T ', '@C ')):
        f = line.split()
        fieldname = f[1]
        i = line.find(fieldname) + len(fieldname)
        fieldvalue = line[i:].lstrip()
        record[fieldname] = fieldvalue

import pprint
pprint.pprint(records)

Good luck with Python.

Upvotes: 1

Related Questions