Reputation: 71
I'm new to python and working on a program as a useful tool for my work. I have a messy, massive amount of data from different sources, and it would save me an enormous amount of time if I could store the data sets as I collect them. So I'm looking to put it together for personal use as quickly as possible, but to continue working on it to improve the code and open it up for use to my colleagues once I can work out effective and secure data sharing. In all likelihood, the coding isn't going to start off very efficient.
The program should write and read (i.e. search for objects) a dictionary of 6 arrays. Ideally, the program will also format and write the data to a fixed-layout document that may be printed. A quick estimate is that a "complete" dictionary product would have between 300,000-400,000 items.
Considering the mutability of the dictionary and its size, is the best way to store the dictionary in json? And considering that anybody using the program would, in most instances, not be using a particularly high performing computer, would this overload the client?
Input string:
citation value source origin stem equivalence
Desired output:
- ORIGIN1
Stem1
Source1: citation1 value1 equivalence1, citation3 value3 equivalence3;
Source2: citation2 value2 equivalence2
Stem2
etc
Upvotes: 0
Views: 69
Reputation: 1790
I think dict
is not the tool you should use.
It is used to store a mapping. The amount of data is not the issue here, 300,000-400,000 items is fair but not huge (if your data is mainly text, your dicts'size would be less than the size of a 740p movie).
But if your data should in the end be structured, in order to be queried and manipulated, then specific and really well-designed tools already exists out there.
Specifically these two modules, both included with the anaconda installation :
sqlite3
to store the data in a database, when the data already has a fixed schema
pandas
and its dataframes. It can handle data less structured than sqlite3, can read and write to its databases, and has an awful lot of great utility functions to do data cleaning.
As you seem to be still unsure about the final schema of your data, I would go for pandas
if I were you, but this is less simple than just dict
Upvotes: 1