How to store properly huge amount of data in an ordered way?

Question

I have a huge amount of data that I need to organize. I think one of the good solutions might be to use a pandas DF and then to pickle the DF to save it. Only problem, I still don't understand how pandas works, and I don't know what approach should be considered in regard of my data.

Data: list of the form ((int, int), float, float) but the first tuple of integer can be of different size.

Example 1:

[((78, 104), 1.55, 0.25),
((78, 104), 1.56, 0.25),
((78, 104), 1.57, 0.25),
((78, 104), 1.58, 0.25),
((75, 100), 5.02, 0.25),
((75, 100), 5.03, 0.25),
((75, 100), 5.04, 0.25),
((75, 100), 5.05, 0.25),
((78, 104), 1.25, 0.333),
((78, 104), 1.26, 0.333)]

Example 2:

[((20, 78, 104), 1.55, 0.25),
((20, 78, 104), 1.56, 0.25),
((21, 78, 104), 1.57, 0.25),
((21, 78, 104), 1.58, 0.25),
((18, 75, 100), 5.02, 0.25),
((18, 75, 100), 5.03, 0.25),
((18, 75, 100), 5.04, 0.25),
((18, 75, 100), 5.05, 0.25),
((20, 78, 104), 1.25, 0.333),
((20, 78, 104), 1.26, 0.333)]

These are just extracts. At the moment I wrote the data in .txt files and then parse it back with string methods when I read a file. As you can see, the tuple can be common to a lot of value (hundreds or thousands) and have a len of 2 or more.

Another way to represent this data is a dictionary of the form: Dict[tuple] = ([list of column 1], [list of column 2]):

data_dict = dict()
for elt in data_list:
    if elt[0] not in data_dict.keys():
        data_dict[elt[0]] = ([elt[1]], [elt[2]])
    else:
        data_dict[elt[0]][0].append(elt[1])
        data_dict[elt[0]][1].append(elt[2])

Example 1:

data_list = [((78, 104), 1.55, 0.25), ((78, 104), 1.56, 0.25), ((78, 104), 1.57, 0.25), ((78, 104), 1.58, 0.25),
((75, 100), 5.02, 0.25), ((75, 100), 5.03, 0.25), ((75, 100), 5.04, 0.25), ((75, 100), 5.05, 0.25), ((78, 104), 1.25, 0.333),
((78, 104), 1.26, 0.333)]

Output:
{(78, 104): ([1.55, 1.56, 1.57, 1.58, 1.25, 1.26], [0.25, 0.25, 0.25, 0.25, 0.333, 0.333]), 
(75, 100): ([5.02, 5.03, 5.04, 5.05], [0.25, 0.25, 0.25, 0.25])}

Example 2:

data_list = [((20, 78, 104), 1.55, 0.25), ((20, 78, 104), 1.56, 0.25), ((21, 78, 104), 1.57, 0.25), ((21, 78, 104), 1.58, 0.25),
((18, 75, 100), 5.02, 0.25), ((18, 75, 100), 5.03, 0.25), ((18, 75, 100), 5.04, 0.25), ((18, 75, 100), 5.05, 0.25), ((20, 78, 104), 1.25, 0.333),
((20, 78, 104), 1.26, 0.333)]

Output:
{(20, 78, 104): ([1.55, 1.56, 1.25, 1.26], [0.25, 0.25, 0.333, 0.333]), 
(21, 78, 104): ([1.57, 1.58], [0.25, 0.25]), 
(18, 75, 100): ([5.02, 5.03, 5.04, 5.05], [0.25, 0.25, 0.25, 0.25])}

What are the possible way (and what is the best one) to store this data in a file, and to access it in an efficient way?

A combination is: ((78, 104), 1.57, 0.25) For the access part, I will, for instance, need to look in file 1 for combinations using the tuple (78, 104) and then in file 2 for the combinations using the tuple (20, 78). The final goal is to find every matching combination, i.e the 2 values after (78, 104) in file 1 must be the same as the 2 values after (20, 78) in file 2. Thus I need to access quickly the interesting combinations of each file.

Thanks for any advice, code, and help on how to represent this data and to store it.

If you ask for, I can put the code for the store / read of .txt file, but I think we will all agree that this is not the best approach for this problem.

artona · Accepted Answer

import pandas as pd
ex_first = [((78, 104), 1.55, 0.25),
    ((78, 104), 1.56, 0.25),
    ((78, 104), 1.57, 0.25),
        ((78, 104), 1.58, 0.25),
    ((75, 100), 5.02, 0.25),
    ((75, 100), 5.03, 0.25),
    ((75, 100), 5.04, 0.25),
    ((75, 100), 5.05, 0.25),
    ((78, 104), 1.25, 0.333),
    ((78, 104), 1.26, 0.333)]

data = pd.DataFrame(ex_first)
data.columns = ["tuple1", "name1", "name2"]
# saving
data.to_csv("DataFrame.csv")

If you want to retrieve the values that tuple is "(78, 104)" you simple look up through:

In [67]: results = data[data.tuple1==(78, 104)]

In [68]: results
Out[68]: 
  tuple1  name1  name2
0  (78, 104)   1.55  0.250
1  (78, 104)   1.56  0.250
2  (78, 104)   1.57  0.250
3  (78, 104)   1.58  0.250
8  (78, 104)   1.25  0.333
9  (78, 104)   1.26  0.333

And if the csv file is big -> read how to use chunks

How to store properly huge amount of data in an ordered way?

Answers (1)

Related Questions