Jay Gattuso
Jay Gattuso

Reputation: 4130

Python - splitting up many small txt files into a data structure to feed into mysql

I have a few hundred thousand txt files that are a pretty standard shape (they all have common elements - ID, Date, To, From, Subject, Body.

These are not in a structured format e.g. a multipart email message.

I want to strip these into their constitute parts, and feed the whole lot into a db. There are lots of them, so I wanted to make sure the approach would work.

There are a number of key issues I am pondering (and I am no coder - this is learning / hobby stuff).

1) is there a structure data type I can cast to keep the bits together in a sensible way. I was thinking it would be logical to have a file.ID, file.Date, etc tryp deal that holds the whole file in a structured way so it can be later ingested into the db. Is this pythony? or a hangover from my tinkering with Matlab?

2) the body section can be several kb large or a single sentance. (1) is this better as a blob - I would loose the searching - kind of the point of doing this.. and (2) How do make sure I can construct a field large enough in my MYSQL database come ingest time? I won't know what the longest size is of each of the element unless I run some kind of counter in the message splitter that handles the max value seen per message

3) I figure that I would start with a walk, get the file list from the walk, then pull each file, line by line. I'll use line position to infer some known locations (ID, Date) and then some RegEx or patterns based on features to split the rest. Once I have split the files up, I plan to ingest them. However, I wonder if it would be more logical to connect to the db at the end of each message and dump the parts into its own record one by one.

Time is no drama, it can run for a week for all that matters. I have about 8gb of RAM on an i7, so again I'm not resource hungry specifically, and happy to let it grind its way through.

Does this sound logical? Have I missed a core step?

Thanks.

Upvotes: 2

Views: 282

Answers (1)

MartinStettner
MartinStettner

Reputation: 29174

Ad 1)

I think the most 'pythonic' way to store this structured data would be to use a dict. The other solution would be to declare a class, but since you don't plan to do further processing (i.e. you wouldn't need any methods for your datatype), you should stick with the simplest possible solution (imo).

Just use

data = {}
# Parse file
data['id'] = id_value;
data['date'] = date_value;
# ... and so on

To store the data from each file in the data.

Ad 2)

On the python side you can just use strings (i.e. str or unicode, if you're on < 3.0). Strings in python have no size limit (besides your architecture limit, but on a 64-bit machine, that's not really a problem ...).

On the MySQL-Side, I would use TEXT as datatype for the body section. You could also use a VARCHAR, but you would need to give a maxmimal length.

Ad 3)

I would recommend to process each file independently, i.e parse it and write it to the db immediately afterwards. Imo there is no reason not to do so. There's no need to fill memory with all the data (or risk a crash just before the last file is read without anything being written to the db). I would probably use some mechanism to mark processed files (move them to another tree, rename them): if I need to restart the program for some reason, this would prevent to process the same file twice.

Upvotes: 1

Related Questions