Reputation: 2014
In an other question some people are trying to insert a Pandas DataFrame into MongoDB using Python internal structures (dict
, list
)
Insert a Pandas Dataframe into mongodb using PyMongo
I wonder if we can't insert instead a NumPy rec.array
(numpy.recarray
) to MongoDB using PyMongo.
That should probably be more efficient because pandas.DataFrame.to_dict
use for loops and that very long to process huge volume of data
In [1]: import pandas as pd
In [2]: import pymongo
In [3]: client = pymongo.MongoClient()
In [4]: collection = client['db_name']['collection_name']
In [5]: df = pd.DataFrame([[1,2,3],[4,5,6]], columns=['a', 'b', 'c'])
In [6]: df
Out[6]:
a b c
0 1 2 3
1 4 5 6
In [7]: rec = df.to_records()
In [8]: rec
Out[8]:
rec.array([(0, 1, 2, 3), (1, 4, 5, 6)],
dtype=[('index', '<i8'), ('a', '<i8'), ('b', '<i8'), ('c', '<i8')])
In [9]: type(rec)
Out[9]: numpy.recarray
but I faced some errors at insert
In [10]: collection.insert(rec)
raised
ValueError: no field of name _id
this
In [11]: collection.insert_many(rec)
raised
TypeError: documents must be a non-empty list
this
In [12]: collection.insert_one(rec)
raised
TypeError: document must be an instance of dict, bson.son.SON, or other type that inherits from collections.MutableMapping
Any idea?
Upvotes: 1
Views: 1636
Reputation: 57271
Odo can do this
In [1]: import pandas as pd
In [2]: import pymongo
In [3]: client = pymongo.MongoClient()
In [4]: collection = client['db_name']['collection_name']
In [5]: df = pd.DataFrame([[1,2,3],[4,5,6]], columns=['a', 'b', 'c'])
In [6]: rec = df.to_records(index=False)
In [7]: from odo import odo
In [8]: odo(rec, collection) # migrate recarray into collection
Out[8]: Collection(Database(MongoClient('localhost', 27017), 'db_name'), 'collection_name')
In [9]: list(collection.find())
Out[9]:
[{'_id': ObjectId('56801e0bfb5d1b19ff9b9dd3'), 'a': 1, 'b': 2, 'c': 3},
{'_id': ObjectId('56801e0bfb5d1b19ff9b9dd4'), 'a': 4, 'b': 5, 'c': 6}]
However it just goes through an iterator of dictionaries (and so is as inefficient as the other solutions in this regard). If you really want to send binary data efficiently over then you should look at monary.
But for loops aren't necessarily the bottleneck here. I highly recommend doing some simple benchmarking to verify that converting to Python data structures here is the bottleneck of your application. You may be optimizing prematurely.
Upvotes: 3