Reputation: 341
I have built a contact form which sends me email for every user registration My question is more related to parsing some text data into csv format. and I have received multiple users information in my mail box which I had copied into a text file. The data looks like below.
Name: testuser2
Email: [email protected]
Cluster Name: o b
Contact No.: 12346971239
Coming: Yes
Name: testuser3
Email: [email protected]
Cluster Name: Mediternea
Contact No.: 9121319107
Coming: Yes
Name: testuser4
Email: [email protected]
Cluster Name: Mediterranea
Contact No.: 7892174896
Coming: Yes
Name: tuser5
Email: [email protected]
Cluster Name: River Retreat A
Contact No.: 7583450912
Coming: Yes
Members Participating: 2
Name: Test User
Email: [email protected]
Cluster Name: RD
Contact No.: 09833123445
Coming: Yes
Members Participating: 2
As can see the data contains some common fields and some fields which are not present, I am looking for solution/suggestion on how I can parse this data so under the heading "Name", I will collect the name information under that column, and similarly for others. For the data with title "Members Participating" I can just pick the numbers and add it into Excel sheet under the same heading, in case this information is not present for the user, it can just be blank.
Upvotes: 0
Views: 3078
Reputation: 46789
The following will convert your input text file automatically to a CSV file. The headings are automatically generated based on the longest entry.
import csv, re
with open("input.txt", "r") as f_input, open("output.csv", "wb") as f_output:
csv_output = csv.writer(f_output)
entries = re.findall("^(Name: .*?)(?:\n\n|\Z)", f_input.read(), re.M+re.S)
# Determine the entry with the most fields for the CSV headers
headings = []
for entry in entries:
headings = max(headings, [line.split(":")[0] for line in entry.split("\n")], key=len)
csv_output.writerow(headings)
# Write the entries
for entry in entries:
csv_output.writerow([line.split(":")[1].strip() for line in entry.split("\n")])
This produces a CSV text file that can be opened in Excel as follows:
Name,Email,Cluster Name,Contact No.,Coming,Members Participating
testuser2,[email protected],o b,12346971239,Yes
testuser3,[email protected],Mediternea,9121319107,Yes
testuser4,[email protected],Mediterranea,7892174896,Yes
tuser5,[email protected],River Retreat A,7583450912,Yes,2
Test User,[email protected],RD,09833123445,Yes,2
Upvotes: 0
Reputation: 87134
You can use the the empty line between records to signify end of record. Then process the input file line-by-line and construct a list of dictionaries. Finally write the dictionaries out to a CSV file.
from csv import DictWriter
from collections import OrderedDict
with open('input') as infile:
registrations = []
fields = OrderedDict()
d = {}
for line in infile:
line = line.strip()
if line:
key, value = [s.strip() for s in line.split(':', 1)]
d[key] = value
fields[key] = None
else:
if d:
registrations.append(d)
d = {}
else:
if d: # handle EOF
registrations.append(d)
# fieldnames = ['Name', 'Email', 'Cluster Name', 'Contact No.', 'Coming', 'Members Participating']
fieldnames = fields.keys()
with open('registrations.csv', 'w') as outfile:
writer = DictWriter(outfile, fieldnames=fields)
writer.writeheader()
writer.writerows(registrations)
This code attempts to automate the collection of field names, and will use the same order as unique keys are first seen in the input. If you require a specific field order in the output you can nail it up by uncommenting the appropriate line.
Running this code on your sample input produces this:
Name,Email,Cluster Name,Contact No.,Coming,Members Participating testuser2,[email protected],o b,12346971239,Yes, testuser3,[email protected],Mediternea,9121319107,Yes, testuser4,[email protected],Mediterranea,7892174896,Yes, tuser5,[email protected],River Retreat A,7583450912,Yes,2 Test User,[email protected],RD,09833123445,Yes,2
Upvotes: 1
Reputation: 14669
Let's decompose the problem into smaller subproblems:
First, let's break the blocks of registration data into different elements:
DATA = '''
Name: testuser2
Email: [email protected]
Cluster Name: o b
Contact No.: 12346971239
Coming: Yes
Name: testuser3
Email: [email protected]
Cluster Name: Mediternea
Contact No.: 9121319107
Coming: Yes
'''
def parse_registrations(data):
data = data.strip()
return data.split('\n\n')
This function gives us a list of each registration:
>>> regs = parse_registrations(DATA)
>>> regs
['Name: testuser2\nEmail: [email protected]\nCluster Name: o b\nContact No.: 12346971239\nComing: Yes', 'Name: testuser3\nEmail: [email protected]\nCluster Name: Mediternea\nContact No.: 9121319107\nComing: Yes']
>>> regs[0]
'Name: testuser2\nEmail: [email protected]\nCluster Name: o b\nContact No.: 12346971239\nComing: Yes'
>>> regs[1]
'Name: testuser3\nEmail: [email protected]\nCluster Name: Mediternea\nContact No.: 9121319107\nComing: Yes'
Next, we can convert those substrings to a list of (key, value) pairs:
>>> [field.split(': ', 1) for field in regs[0].split('\n')]
[['Name', 'testuser2'], ['Email', '[email protected]'], ['Cluster Name', 'o b'], ['Contact No.', '12346971239'], ['Coming', 'Yes']]
The dict()
function can convert a list of (key, value) pairs into a dictionary:
>>> dict(field.split(': ', 1) for field in regs[0].split('\n'))
{'Coming': 'Yes', 'Cluster Name': 'o b', 'Name': 'testuser2', 'Contact No.': '12346971239', 'Email': '[email protected]'}
We can pass these dictionaries into a csv.DictWriter to write the records as CSV with defaults for any missing values.
>>> w = csv.DictWriter(open("/tmp/foo.csv", "w"), fieldnames=["Name", "Email", "Cluster Name", "Contact No.", "Coming", "Members Participating"])
>>> w.writeheader()
>>> w.writerow({'Name': 'Steve'})
12
Now, let's combine these all together!
import csv
DATA = '''
Name: testuser2
Email: [email protected]
Cluster Name: o b
Contact No.: 12346971239
Coming: Yes
Name: tuser5
Email: [email protected]
Cluster Name: River Retreat A
Contact No.: 7583450912
Coming: Yes
Members Participating: 2
'''
COLUMNS = ["Name", "Email", "Cluster Name", "Contact No.", "Coming", "Members Participating"]
def parse_registration(reg):
return dict(field.split(': ', 1) for field in reg.split('\n'))
def parse_registrations(data):
data = data.strip()
regs = data.split('\n\n')
return [parse_registration(r) for r in regs]
def write_csv(data, filename):
regs = parse_registrations(data)
with open(filename, 'w') as f:
writer = csv.DictWriter(f, fieldnames=COLUMNS)
writer.writeheader()
writer.writerows(regs)
if __name__ == '__main__':
write_csv(DATA, "/tmp/test.csv")
Output:
$ python3 write_csv.py
$ cat /tmp/test.csv
Name,Email,Cluster Name,Contact No.,Coming,Members Participating
testuser2,[email protected],o b,12346971239,Yes,
tuser5,[email protected],River Retreat A,7583450912,Yes,2
Upvotes: 2
Reputation: 168876
The program below might satisfy your requirement. The general strategy:
csv.DictWriter.writerows()
.
import sys
import pprint
import csv
# Usage:
# python cfg2csv.py input1.cfg input2.cfg ...
# The data is combined and written to 'output.csv'
def parse_file(data):
total_result = []
single_result = []
for line in data:
line = line.strip()
if line:
single_result.append([item.strip() for item in line.split(':', 1)])
else:
if single_result:
total_result.append(dict(single_result))
single_result = []
if single_result:
total_result.append(dict(single_result))
return total_result
def read_file(filename):
with open(filename) as fp:
return parse_file(fp)
# First parse the data:
data = sum((read_file(filename) for filename in sys.argv[1:]), [])
keys = set().union(*data)
# Next write the data to a CSV file
with open('output.csv', 'w') as fp:
writer = csv.DictWriter(fp, sorted(keys))
writer.writeheader()
writer.writerows(data)
Upvotes: 1