Femto Trader
Femto Trader

Reputation: 2014

Insert a NumPy rec.array to MongoDB using PyMongo

In an other question some people are trying to insert a Pandas DataFrame into MongoDB using Python internal structures (dict, list) Insert a Pandas Dataframe into mongodb using PyMongo

I wonder if we can't insert instead a NumPy rec.array (numpy.recarray) to MongoDB using PyMongo.

That should probably be more efficient because pandas.DataFrame.to_dict use for loops and that very long to process huge volume of data

see https://github.com/pydata/pandas/blob/c45dc762655d7109362fecea05584c72351fdc83/pandas/core/frame.py#L854

In [1]: import pandas as pd
In [2]: import pymongo
In [3]: client = pymongo.MongoClient()
In [4]: collection = client['db_name']['collection_name']
In [5]: df = pd.DataFrame([[1,2,3],[4,5,6]], columns=['a', 'b', 'c'])
In [6]: df
Out[6]:
   a  b  c
0  1  2  3
1  4  5  6
In [7]: rec = df.to_records()
In [8]: rec
Out[8]:
rec.array([(0, 1, 2, 3), (1, 4, 5, 6)],
          dtype=[('index', '<i8'), ('a', '<i8'), ('b', '<i8'), ('c', '<i8')])
In [9]: type(rec)
Out[9]: numpy.recarray

but I faced some errors at insert

In [10]: collection.insert(rec)

raised

ValueError: no field of name _id

this

In [11]: collection.insert_many(rec)

raised

TypeError: documents must be a non-empty list

this

In [12]: collection.insert_one(rec)

raised

TypeError: document must be an instance of dict, bson.son.SON, or other type that inherits from collections.MutableMapping

Any idea?

Upvotes: 1

Views: 1636

Answers (1)

MRocklin
MRocklin

Reputation: 57271

Odo can do this

In [1]: import pandas as pd
In [2]: import pymongo
In [3]: client = pymongo.MongoClient()
In [4]: collection = client['db_name']['collection_name']

In [5]: df = pd.DataFrame([[1,2,3],[4,5,6]], columns=['a', 'b', 'c'])
In [6]: rec = df.to_records(index=False)

In [7]: from odo import odo
In [8]: odo(rec, collection)  # migrate recarray into collection
Out[8]: Collection(Database(MongoClient('localhost', 27017), 'db_name'), 'collection_name')

In [9]: list(collection.find())
Out[9]: 
[{'_id': ObjectId('56801e0bfb5d1b19ff9b9dd3'), 'a': 1, 'b': 2, 'c': 3},
 {'_id': ObjectId('56801e0bfb5d1b19ff9b9dd4'), 'a': 4, 'b': 5, 'c': 6}]

However it just goes through an iterator of dictionaries (and so is as inefficient as the other solutions in this regard). If you really want to send binary data efficiently over then you should look at monary.

But for loops aren't necessarily the bottleneck here. I highly recommend doing some simple benchmarking to verify that converting to Python data structures here is the bottleneck of your application. You may be optimizing prematurely.

Upvotes: 3

Related Questions