Dominik Neise
Dominik Neise

Reputation: 1249

insert fields of numpy structured array into mongodb

I'm currently investigating if it is possible to use structured numpy arrays more or less directly as documents for mongodb insert operations.

In all examples I have found

db.collection.insert(doc)

doc is always a Python dict, but I wonder if not any instance that provides the mapping interface might be usable for insert operations.

I was thinking to subclass np.ndarray using DictMixin or MutableMapping so it really provides a dict interface. And then do something like this:

structured_array = np.zeros( (5,), dtype=[('i', '<i4'), ('f', '<f4')] )
structured_array['i'] = np.random.randint(42, size=5)
structured_array['f'] = np.random.rand(5)

for row in structured_array:
    # row is of type: np.void
    # so in order to let pymongo insert it into the DB, I create a 
    # view of row, which provides the dict-like interface
    row_dict_like = row.view(np_array_subclass_providing_dict_interface)
    db.collection.insert(row_dict_like)

Now, since I am a bloody beginner and have never ever subclassed np.ndarray and fear I might dump many hours into this, just to learn later, that the whole approach was not very smart, my question is: Do you see major problems in this approach? Is it Pythonic? Is my assumption, that any class providing the mapping interface can be used for mongodb insert operations, correct at all?

Upvotes: 2

Views: 367

Answers (1)

shx2
shx2

Reputation: 64298

No doubt your question deserves a "pure" python/numpy-only answer, which I'm sure others will provide. But:

I'd like to point out that in many of the cases where you find numpy's interface cumbersome and/or unintuitive, using pandas can make your life easier.

In your example, one way to leverage pandas is to create a DataFrame, and iterate over its rows using iterrows(). Each row is a (more or less) dict-like object.

import pandas as pd

df = pd.DataFrame.from_records(structured_array)
for i, row in df.iterrows():
    print row.iteritems()
[('i', 14.0), ('f', 0.099248834)]
[('i', 31.0), ('f', 0.69031882)]
[('i', 32.0), ('f', 0.85714084)]
[('i', 14.0), ('f', 0.64561093)]
[('i', 8.0), ('f', 0.18835814)]

for i, row in df.iterrows():
    print dict(row)
{'i': 14.0, 'f': 0.099248834}
{'i': 31.0, 'f': 0.69031882}
{'i': 32.0, 'f': 0.85714084}
{'i': 14.0, 'f': 0.64561093}
{'i': 8.0, 'f': 0.18835814}

However, you might want to consider refactoring your code to work with DataFrames to begin with, which are way more intuitive that recarrays.

Of course, this requires that you install pandas, which is highly recomended in general.

Upvotes: 1

Related Questions