Reputation: 111

Parse text from a .txt file using csv module

I have an email that comes in everyday and the format of the email is always the same except some of the data is different. I wrote a VBA Macro that exports the email to a text file. Now that it is a text file I want to parse the data so that I get only the new information.

The format of email is like this

> Irrelevant data  
> Irrelevant data
> Type:     YUK
> Status:   OPEN
> Date:     6/22/2015
> Description: ----
> 
> Description blah blah blah
> Thank you

I want to capture the relevant data. For example, in this case i would like to only capture YUK, OPEN, 6/22/2015 and the Description blah blah blah. I tried using the csv module to go line by line and print out the lines but i cant seem to find a way to parse that information.

This is what I have so far. it only prints out the lines though.

import os
import glob
import csv

path = "emailtxt/"
glob = max(glob.iglob(path + '*.txt'), key=os.path.getctime)#most recent file located in the emailtxt
newestFile = os.path.basename(glob)#removes the emailtxt\ from its name

f = open(path+newestFile)

read = csv.reader(f)
for row in read:
    print row

f.close()

How would I parse through the text file?

Upvotes: 1

Answers (4)

jhoepken

Reputation: 1858

I don't think that there the cvs module is the one to be used here. If you are just going for a simple search, use string comparisons and split them by characteristic characters. If it's more sophisticated, go for regular expressions.

import os

with open("email.txt") as file:
    data = [line.replace("> ","") for line in file.readlines()]
    for line in data:
        s = line.split(":")
        if len(s) > 1:
            print s[1].strip()

Upvotes: 1

Kavin Eswaramoorthy

Reputation: 1625

How about using Regular Expression

def get_info(string_to_search):
    res_dict = {}
    import re

    find_type = re.compile("Type:[\s]*[\w]*")
    res = find_type.search(string_to_search)
    res_dict["Type"] = res.group(0).split(":")[1].strip()

    find_Status = re.compile("Status:[\s]*[\w]*")
    res = find_Status.search(string_to_search)
    res_dict["Status"] = res.group(0).split(":")[1].strip()

    find_date = re.compile("Date:[\s]*[/0-9]*")
    res = find_date.search(string_to_search)
    res_dict["Date"] = res.group(0).split(":")[1].strip()

    res_dict["description"] = string_to_search.split("Description:")[1].replace("Thank you","")
    return res_dict


search_string = """> Irrelevant data
> Irrelevant data
> Type:     YUK
> Status:   OPEN
> Date:     6/22/2015
> Description: ----
>
> Description blah blah blah
> Thank you
"""
info =  get_info(search_string)

print info
print info["Type"]
print info["Status"]
print info["Date"]
print info["description"]

Output :

{'Status': 'OPEN', 'Date': '6/22/2015', 'Type': 'YUK', 'description': ' ----\n>\n> Description blah blah blah\n> \n'}
YUK
OPEN
6/22/2015
 ----
>
> Description blah blah blah
>

Upvotes: 1

sammysignal

Reputation: 11

If you are able to print out the rows individually, parsing them is a matter of breaking apart the rows (that are represented as strings). Assuming there is some space after each item descriptor, or a colon after each descriptor, you could use that to parse whatever comes after that colon and space. see the python string common operations to be able to split the row at useful points.

In terms of actually parsing the data, You could do a series of if statements to catch each status, or file type. for the date, try the time.strptime function to evaluate the date to a datetime object. All you have to do is match the format of the date, which in your case seems to be "%m/%d/%y".

Upvotes: 1

Iron Fist

Reputation: 10951

I don't thin here you are in need of csv module at all, just regular File I/O will do for you what you want, i.e; read the file line by line and from each line extract the data you need and store it in a list for example:

import os
import glob

path = "emailtxt/"
glob = max(glob.iglob(path + '*.txt'), key=os.path.getctime)#most recent file located in the emailtxt
newestFile = os.path.basename(glob)          #removes the emailtxt\ from its name

capture_list = []                            #list to hold captured words
with open(path+newestFile, 'r') as f:        #open the file for reading
    for line in f:                           #Go line by line
        capture_list.append(line.split()[2]) #add to the list last word

Upvotes: 1

Parse text from a .txt file using csv module

Answers (4)

Related Questions