OmPS
OmPS

Reputation: 341

Parsing data from a text file

I have built a contact form which sends me email for every user registration My question is more related to parsing some text data into csv format. and I have received multiple users information in my mail box which I had copied into a text file. The data looks like below.

Name: testuser2
Email: [email protected]
Cluster Name: o  b
Contact No.: 12346971239
Coming: Yes

Name: testuser3
Email: [email protected]
Cluster Name: Mediternea
Contact No.: 9121319107
Coming: Yes

Name: testuser4
Email: [email protected]
Cluster Name: Mediterranea
Contact No.: 7892174896
Coming: Yes

Name: tuser5
Email: [email protected]
Cluster Name: River Retreat A
Contact No.: 7583450912
Coming: Yes
Members Participating: 2

Name: Test User
Email: [email protected]
Cluster Name: RD
Contact No.: 09833123445
Coming: Yes
Members Participating: 2

As can see the data contains some common fields and some fields which are not present, I am looking for solution/suggestion on how I can parse this data so under the heading "Name", I will collect the name information under that column, and similarly for others. For the data with title "Members Participating" I can just pick the numbers and add it into Excel sheet under the same heading, in case this information is not present for the user, it can just be blank.

Upvotes: 0

Views: 3078

Answers (4)

Martin Evans
Martin Evans

Reputation: 46789

The following will convert your input text file automatically to a CSV file. The headings are automatically generated based on the longest entry.

import csv, re

with open("input.txt", "r") as f_input, open("output.csv", "wb") as f_output:
    csv_output = csv.writer(f_output)
    entries = re.findall("^(Name: .*?)(?:\n\n|\Z)", f_input.read(), re.M+re.S)

    # Determine the entry with the most fields for the CSV headers
    headings = []
    for entry in entries:
        headings = max(headings, [line.split(":")[0] for line in entry.split("\n")], key=len)
    csv_output.writerow(headings)

    # Write the entries
    for entry in entries:
        csv_output.writerow([line.split(":")[1].strip() for line in entry.split("\n")])

This produces a CSV text file that can be opened in Excel as follows:

Name,Email,Cluster Name,Contact No.,Coming,Members Participating
testuser2,[email protected],o  b,12346971239,Yes
testuser3,[email protected],Mediternea,9121319107,Yes
testuser4,[email protected],Mediterranea,7892174896,Yes
tuser5,[email protected],River Retreat A,7583450912,Yes,2
Test User,[email protected],RD,09833123445,Yes,2

Upvotes: 0

mhawke
mhawke

Reputation: 87134

You can use the the empty line between records to signify end of record. Then process the input file line-by-line and construct a list of dictionaries. Finally write the dictionaries out to a CSV file.

from csv import DictWriter
from collections import OrderedDict

with open('input') as infile:
    registrations = []
    fields = OrderedDict()
    d = {}
    for line in infile:
        line = line.strip()
        if line:
            key, value = [s.strip() for s in line.split(':', 1)]
            d[key] = value
            fields[key] = None
        else:
            if d:
                registrations.append(d)
                d = {}
    else:
        if d:    # handle EOF
            registrations.append(d)


# fieldnames = ['Name', 'Email', 'Cluster Name', 'Contact No.', 'Coming', 'Members Participating']
fieldnames = fields.keys()

with open('registrations.csv', 'w') as outfile:
    writer = DictWriter(outfile, fieldnames=fields)
    writer.writeheader()
    writer.writerows(registrations)

This code attempts to automate the collection of field names, and will use the same order as unique keys are first seen in the input. If you require a specific field order in the output you can nail it up by uncommenting the appropriate line.

Running this code on your sample input produces this:

Name,Email,Cluster Name,Contact No.,Coming,Members Participating
testuser2,[email protected],o  b,12346971239,Yes,
testuser3,[email protected],Mediternea,9121319107,Yes,
testuser4,[email protected],Mediterranea,7892174896,Yes,
tuser5,[email protected],River Retreat A,7583450912,Yes,2
Test User,[email protected],RD,09833123445,Yes,2

Upvotes: 1

Steven Kryskalla
Steven Kryskalla

Reputation: 14669

Let's decompose the problem into smaller subproblems:

  1. Split the large block of text into separate registrations
  2. Convert each of those registrations to a dictionary
  3. Write the list of dictionaries to CSV

First, let's break the blocks of registration data into different elements:

DATA = '''
Name: testuser2
Email: [email protected]
Cluster Name: o  b
Contact No.: 12346971239
Coming: Yes

Name: testuser3
Email: [email protected]
Cluster Name: Mediternea
Contact No.: 9121319107
Coming: Yes
'''

def parse_registrations(data):
    data = data.strip()
    return data.split('\n\n')

This function gives us a list of each registration:

>>> regs = parse_registrations(DATA)
>>> regs
['Name: testuser2\nEmail: [email protected]\nCluster Name: o  b\nContact No.: 12346971239\nComing: Yes', 'Name: testuser3\nEmail: [email protected]\nCluster Name: Mediternea\nContact No.: 9121319107\nComing: Yes']
>>> regs[0]
'Name: testuser2\nEmail: [email protected]\nCluster Name: o  b\nContact No.: 12346971239\nComing: Yes'
>>> regs[1]
'Name: testuser3\nEmail: [email protected]\nCluster Name: Mediternea\nContact No.: 9121319107\nComing: Yes'

Next, we can convert those substrings to a list of (key, value) pairs:

>>> [field.split(': ', 1) for field in regs[0].split('\n')]
[['Name', 'testuser2'], ['Email', '[email protected]'], ['Cluster Name', 'o  b'], ['Contact No.', '12346971239'], ['Coming', 'Yes']]

The dict() function can convert a list of (key, value) pairs into a dictionary:

>>> dict(field.split(': ', 1) for field in regs[0].split('\n'))
{'Coming': 'Yes', 'Cluster Name': 'o  b', 'Name': 'testuser2', 'Contact No.': '12346971239', 'Email': '[email protected]'}

We can pass these dictionaries into a csv.DictWriter to write the records as CSV with defaults for any missing values.

>>> w = csv.DictWriter(open("/tmp/foo.csv", "w"), fieldnames=["Name", "Email", "Cluster Name", "Contact No.", "Coming", "Members Participating"])
>>> w.writeheader()
>>> w.writerow({'Name': 'Steve'})
12

Now, let's combine these all together!

import csv

DATA = '''
Name: testuser2
Email: [email protected]
Cluster Name: o  b
Contact No.: 12346971239
Coming: Yes

Name: tuser5
Email: [email protected]
Cluster Name: River Retreat A
Contact No.: 7583450912
Coming: Yes
Members Participating: 2
'''

COLUMNS = ["Name", "Email", "Cluster Name", "Contact No.", "Coming", "Members Participating"]

def parse_registration(reg):
    return dict(field.split(': ', 1) for field in reg.split('\n'))

def parse_registrations(data):
    data = data.strip()
    regs = data.split('\n\n')
    return [parse_registration(r) for r in regs]

def write_csv(data, filename):
    regs = parse_registrations(data)
    with open(filename, 'w') as f:
        writer = csv.DictWriter(f, fieldnames=COLUMNS)
        writer.writeheader()
        writer.writerows(regs)

if __name__ == '__main__':
    write_csv(DATA, "/tmp/test.csv")

Output:

$ python3 write_csv.py

$ cat /tmp/test.csv
Name,Email,Cluster Name,Contact No.,Coming,Members Participating
testuser2,[email protected],o  b,12346971239,Yes,
tuser5,[email protected],River Retreat A,7583450912,Yes,2

Upvotes: 2

Robᵩ
Robᵩ

Reputation: 168876

The program below might satisfy your requirement. The general strategy:

  • First read in all of the email files, parsing the data "by hand", and
  • Second write the data to a CSV file, using csv.DictWriter.writerows().

 

import sys
import pprint
import csv

# Usage:
# python cfg2csv.py input1.cfg input2.cfg ...
# The data is combined and written to 'output.csv'

def parse_file(data):
    total_result = []
    single_result = []
    for line in data:
        line = line.strip()
        if line:
            single_result.append([item.strip() for item in line.split(':', 1)])
        else:
            if single_result:
                total_result.append(dict(single_result))
            single_result = []
    if single_result:
        total_result.append(dict(single_result))
    return total_result

def read_file(filename):
    with open(filename) as fp:
        return parse_file(fp)

# First parse the data:
data = sum((read_file(filename) for filename in sys.argv[1:]), [])
keys = set().union(*data)

# Next write the data to a CSV file
with open('output.csv', 'w') as fp:
    writer = csv.DictWriter(fp, sorted(keys))
    writer.writeheader()
    writer.writerows(data)

Upvotes: 1

Related Questions