user1097214
user1097214

Reputation:

Python - creating dictionary for each key given file

I'm working on a text-parsing algorithm (open-source side project). I'd be very appreciative for any advice.

I have a tab-delimited txt file which is sorted by the first column (sample dataset below). Duplicate entries exist within this column. Ultimately, I would like to use a hash to point to all values of which have the same key (first column value). Should a new key come along, the contents of the hash are to be serialized, saved, etc, and then cleared for the new key to populate it. As a result, my goal is to have only 1 key present. Therefore, if I have N unique keys, I wish to make N hashes each pointing to their respective entry. Datasets though are GBs in size and in-memory heaps won't be much help, hence my reasoning to create a hash per key and process each individually.

SAMPLE DATASET

A    ...    23.4421
A    ...   -23.442
A    ...    76.2224
B    ...    32.1232
B    ...    -23.001
C    ...    652.123
...

So in the above dataset snippet, I wish to have a hash for 'A' (pointing to its 3x respective items). When 'B' is read, serialize the 'A' hash and clear the hash-contents. Repeat for 'B' until end of dataset.

My pseudocode is as follows:

declare hash
for item in the dataset:
    key, value = item[0], item[1:]
    if key not in hash:
        if hash.size is 0: // pertains to the very first item
            hash.put(key, value)
        else:
            clear hash // if a new key is read but a diff. key is present. 
    else:
        hash.put(key, value) // key already there so append it.

If any suggestions exist as to how to efficiently implement the above algorithm, I'd be very appreciative. Also, if my hash-reasoning/approach is not efficient or if improvements could be brought-up, I'd be very thankful. My goal is to ultimately create in-memory hashes until a new key comes along.

Thank you,

p.

Upvotes: 0

Views: 425

Answers (3)

PaulMcG
PaulMcG

Reputation: 63782

Use itertools.groupby, passing it the file as an iterator:

from itertools import groupby
from cStringIO import StringIO

sourcedata = StringIO("""\
A    ...    23.4421
A    ...   -23.442
A    ...    76.2224
B    ...    32.1232
B    ...    -23.001
C    ...    652.123""")

# or sourcedata = open("zillion_gigabyte_file.dat")

for key,lines in groupby(sourcedata, key=lambda s:s.split()[0]):
    accum = [float(s.split()[2]) for s in lines]
    print key, accum

groupby is very smart and efficient, keeping very little data in memory at a time, keeping things purely in the form of iterators until the last possible moment. What you describe out hashes and keeping only one in memory at a time and all that, is already done for you in groupby.

Upvotes: 1

holdenweb
holdenweb

Reputation: 37163

You don't make it explicit whether the sorted data you give is typical or whether the same key can be interspersed with other keys throughout the file, and that does make a fundamental difference to the algorithm. I deduce from your sample code that they will appear in arbitrary order.

Neither do you say to what use you intend to put the data so extracted. This can matter a lot—there are many different ways to store data, and the application can be a crucial feature in determining access times. So you may want to consider the use of a various different storage types. Without knowing how you propose to use the resulting structure the following suggestion may be inappropriate.

Since the data are floating-point numbers then you may want to consider using the shelve module to maintain simple lists of the floating point numbers keyed agains the alphamerics. This has the advantage that all pickling and unpickling to/from external storage is handled automatically. If you need an increase in speed consider using a more efficient pickling protocol (one of the unused arguments to shelve.open()).

# Transform the data:
# note it's actually more efficient to process line-by-line
# as you read it from a file - obviously it's best to try
# to avoid reading the whole data set into memory at once.
data = """\
A    ...    23.4421
A    ...   -23.442
A    ...    76.2224
B    ...    32.1232
B    ...    -23.001
C    ...    652.123"""

data = [(k, float(v))
          for (k, _, v) in
          [_.split() for _ in data.splitlines()]]

# create a shelve
import shelve
shelf = shelve.open("myshelf", "c")

# process data
for (k, v) in data:
    if k in shelf:
        # see note below for rationale
        tmp = shelf[k]
        tmp.append(v)
        shelf[k] = tmp
    else:
        shelf[k] = [v]

# verify results
for k in shelf.keys():
    print k, shelf[k]

You may be wondering why I didn't just use shelf[k].append(v) in the case where a key has already been seen. This is because it's only the operation of key assignment that triggers detection of the value change. You can read the shelve module docs for more detail, and to learn how to use the binary pickle format.

Note also that this program re-creates the shelf each time it is run due to the "c" argument to shelve.open().

Upvotes: 0

dstromberg
dstromberg

Reputation: 7187

You could open an anydbm (2.x) or dbm (3.x) for each key in your first column, named by the value of the column. This is pretty trivial - I'm not sure what the question is.

You could also use something like my cachedb module, to let it figure out whether something is "new" or not: http://stromberg.dnsalias.org/~strombrg/cachedb.html I've used it in two projects, both with good results.

Either way, you could probably make your keys just lists of ascii floats separated by newlines or nulls or something.

Upvotes: 0

Related Questions