Reputation: 1
I am a complete beginner to Python. I am coming across text files that I need to format. I basically need to take field data that starts with a certain character and output the field data to a new file which will have all the fields delimited by character of my choosing.
Here is a short example.
; Record 1
@FULLTEXT PAGE
@T R000358
@C ENDDOC# R000358
@C BEGATTACH R000358
@C ENDATTACH R000358
@C MAILSTORE No
@C AUTHOR
@C BCC
@C CC
@C COMMENTS
@C ATTACH
@C DATECREATED 11/23/2010
@C DATELASTMOD 07/18/2010
@C DATELASTPRNT
@C DATERCVD
@C DATESENT
@C FILENAME wrangling.wpd
@C LASTAUTHOR
@C ORGANIZATION
@C REVISION
@C SUBJECT
@C TIMEACCESSED 00:00:00
@C TIMECREATED 15:21:34
@C TIMELASTMOD 09:04:12
@C TIMELASTPRNT
@C TIMERCVD
@C TIMESENT
@C TITLE
@C TO
@C FROM
For each 'Record' the '@C' and '@T' is the field delimiter followed by a space, then the field name followed by a space, then the field data. I need all the field data delimited in one row rather then a column as shown above.
I am looking to output to a new file each record to something like this.
"R000358","R000358","R000358","R000358","No",etc, etc. (in one row)
This example is comma delimited but it may change but I figured I would start there.
Any help would be appreciated. Thanks in advance.
Upvotes: 0
Views: 211
Reputation: 120608
record = None
records = []
with open('records.dat') as stream:
for line in stream:
item = line.strip().split()
if not item:
continue
if item[0] == ';':
record = []
records.append((item[-1], record))
elif record is not None:
if item[0] == '@C' and len(item) <= 2:
record.append('')
elif item[0] in ('@T', '@C'):
record.append(item[-1])
for identifier, record in records:
print '[%s]: %s' % (identifier, ', '.join(record))
Upvotes: 0
Reputation: 1
Start by opening the file:
with open('inputfile','r') as fil:
# file read-in stuff here
Use the with
idiom if you're using python 2.5 and up, otherwise do:
try:
fil = open('inputfile','r')
# file read-in stuff here
finally:
fil.close()
To read the file contents into strings, check out file.readline()
(reads one line at a time; use for big files) and file.readlines()
(reads entire file into a list, one string per entry) here.
To write the file, use the above logic for reading except open the file in write mode, like this: open('outputfile','w')
To handle formatting for your output file, look at the string methods here. Specifically, take a look at str.split()
and str.join()
, which let you easily split strings into lists and concatenate list elements into strings by delimiter.
Upvotes: 0
Reputation: 2556
def getRecordRows( file, start_characters, delimiter):
returnRows = []
for line in open(file):
if line.startswith(start_characters):
returnRows.append( line[len(start_characters):] )
return delimiter.join( returnRows )
Example usage:
file = /path/to/file
getRecordRows(file, '@T', ',')
Upvotes: 0
Reputation: 226336
It is unclear how the records are delimited and what exactly you would like to do with your output, but here is a simple parser that should get you started:
s = '''\
; Record 1
@FULLTEXT PAGE
@T R000358
@C ENDDOC# R000358
@C BEGATTACH R000358
@C ENDATTACH R000358
@C MAILSTORE No
@C AUTHOR
@C BCC
@C CC
@C COMMENTS
@C ATTACH
@C DATECREATED 11/23/2010
@C DATELASTMOD 07/18/2010
@C DATELASTPRNT
@C DATERCVD
@C DATESENT
@C FILENAME wrangling.wpd
@C LASTAUTHOR
@C ORGANIZATION
@C REVISION
@C SUBJECT
@C TIMEACCESSED 00:00:00
@C TIMECREATED 15:21:34
@C TIMELASTMOD 09:04:12
@C TIMELASTPRNT
@C TIMERCVD
@C TIMESENT
@C TITLE
@C TO
@C FROM
'''.splitlines()
records = []
record = {}
for line in s:
if line.startswith('; Record'):
record = {}
records.append(record)
elif line.startswith(('@T ', '@C ')):
f = line.split()
fieldname = f[1]
i = line.find(fieldname) + len(fieldname)
fieldvalue = line[i:].lstrip()
record[fieldname] = fieldvalue
import pprint
pprint.pprint(records)
Good luck with Python.
Upvotes: 1