randomrabbit
randomrabbit

Reputation: 322

Large python dictionary. Storing, loading, and writing to it

I have a large python dictionary of values (around 50 GB), and I've stored it as a JSON file. I am having efficiency issues when it comes to opening the file and writing to the file. I know you can use ijson to read the file efficiently, but how can I write to it efficiently?

Should I even be using a Python dictionary to store my data? Is there a limit to how large a python dictionary can be? (the dictionary will get larger).

The data basically stores the path length between nodes in a large graph. I can't store the data as a graph because searching for a connection between two nodes takes too long.

Any help would be much appreciated. Thank you!

Upvotes: 7

Views: 2520

Answers (2)

frankegoesdown
frankegoesdown

Reputation: 1924

try to use it with pandas: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html

pandas.read_json(path_or_buf=None, orient=None, typ='frame', dtype=True, convert_axes=True, convert_dates=True, keep_default_dates=True, numpy=False, precise_float=False, date_unit=None, encoding=None, lines=False, chunksize=None, compression='infer')
Convert a JSON string to pandas object

it very lightweight and useful library to work with large data

Upvotes: 1

HakunaMaData
HakunaMaData

Reputation: 1321

Although it will truly depend on what operations you want to perform on your network dataset you might want to considering storing this as a pandas Dataframe and then write it to disk using Parquet or Arrow.

That data could then be loaded to networkx or even to Spark (GraphX) for any network related operations.

Parquet is compressed and columnar and makes reading and writing to files much faster especially for large datasets.

From the Pandas Doc:

Apache Parquet provides a partitioned binary columnar serialization for data frames. It is designed to make reading and writing data frames efficient, and to make sharing data across data analysis languages easy. Parquet can use a variety of compression techniques to shrink the file size as much as possible while still maintaining good read performance.

Parquet is designed to faithfully serialize and de-serialize DataFrame s, supporting all of the pandas dtypes, including extension dtypes such as datetime with tz.

Read further here: Pandas Parquet

Upvotes: 2

Related Questions