large nested lists versus dictionaries

Question

Please could I solicit some general advice regarding Python lists. I know I shouldn't answer 'open' questions on here but I am worried about setting off on completely the wrong path.

My problem is that I have .csv files that are approximately 600,000 lines long each. Each row of the .csv has 6 fields, of which the first field is a date-time stamp in the format DD/MM/YYYY HH:MM:SS. The next two fields are blank and the last three fields contain float and integer values, so for example:

23/05/2017 16:42:17,  ,   , 1.25545, 1.74733, 12 
23/05/2017 16:42:20,  ,   , 1.93741, 1.52387, 14 
23/05/2017 16:42:23,  ,   , 1.54875, 1.46258, 11

etc

No two values in column 1 (date-time stamp) will ever be the same.

I need to write a program that will do a few basic operations with the data, such as:

read all of the data into a dictionary, list, set (?) etc as appropriate.
search through the date time stamp column for a particular value.
read through the list and do basic calculations on the floats in columns 4 and 5.
write a new list based on the searches/calculations.

My question is - how should I 'handle' the data and am I likely to run into problems due to the length of the dataset?

For example, should I import all of the data into a list, and each element of the list is a sublist of each rows data? E.g:

[[23/05/2017 16:42:17,'','', 1.25545, 1.74733, 12],[23/05/2017 16:42:20,'','', 1.93741, 1.52387, 14], ...]

Or would it be better to make each date-time stamp the 'key' in a dictionary and make the dictionary 'value' a list with all the other values, e.g:

{'23/05/2017 16:42:17': [ , , 1.25545, 1.74733, 12], ...} etc

If I use the list approach, is there a way to get Python to 'search' in only the first column for a particular time stamp rather than making it search through 600,000 rows times 6 columns when we know that only the first column contains timestamps?

I apologize if my query is a little vague, but would appreciate any guidance that anyone can offer.

Eric Duminil · Accepted Answer

600000 lines aren't that many, your script should run fine with either a list or a dict.

As a test, let's use:

data = [["2017-05-02 17:28:24", 0.85260, 1.16218, 7],
["2017-05-04 05:40:07", 0.72118, 0.47710, 15],
["2017-05-07 19:27:53", 1.79476, 0.47496, 14],
["2017-05-09 01:57:10", 0.44123, 0.13711, 16],
["2017-05-11 07:22:57", 0.17481, 0.69468, 0],
["2017-05-12 10:11:01", 0.27553, 0.47834, 4],
["2017-05-15 05:20:36", 0.01719, 0.51249, 7],
["2017-05-17 14:01:13", 0.35977, 0.50052, 7],
["2017-05-17 22:05:33", 1.68628, 1.90881, 13],
["2017-05-18 14:44:14", 0.32217, 0.96715, 14],
["2017-05-18 20:24:23", 0.90819, 0.36773, 5],
["2017-05-21 12:15:20", 0.49456, 1.12508, 5],
["2017-05-22 07:46:18", 0.59015, 1.04352, 6],
["2017-05-26 01:49:38", 0.44455, 0.26669, 13],
["2017-05-26 18:55:24", 1.33678, 1.24181, 7]]

dict

If you're looking for exact timestamps, a lookup will be much faster with a dict than with a list. You have to know exactly what you're looking for though: "23/05/2017 16:42:17" has a completely different hash than "23/05/2017 16:42:18".

data_as_dict = {l[0]: l[1:] for l in data}
print(data_as_dict)
# {'2017-05-21 12:15:20': [0.49456, 1.12508, 5], '2017-05-18 14:44:14': [0.32217, 0.96715, 14], '2017-05-04 05:40:07': [0.72118, 0.4771, 15], '2017-05-26 01:49:38': [0.44455, 0.26669, 13], '2017-05-17 14:01:13': [0.35977, 0.50052, 7], '2017-05-15 05:20:36': [0.01719, 0.51249, 7], '2017-05-26 18:55:24': [1.33678, 1.24181, 7], '2017-05-07 19:27:53': [1.79476, 0.47496, 14], '2017-05-17 22:05:33': [1.68628, 1.90881, 13], '2017-05-02 17:28:24': [0.8526, 1.16218, 7], '2017-05-22 07:46:18': [0.59015, 1.04352, 6], '2017-05-11 07:22:57': [0.17481, 0.69468, 0], '2017-05-18 20:24:23': [0.90819, 0.36773, 5], '2017-05-12 10:11:01': [0.27553, 0.47834, 4], '2017-05-09 01:57:10': [0.44123, 0.13711, 16]}

print(data_as_dict.get('2017-05-17 14:01:13'))
# [0.35977, 0.50052, 7]

print(data_as_dict.get('2017-05-17 14:01:10'))
# None

Note that your DD/MM/YYYY HH:MM:SS format isn't very convenient : sorting the cells lexicographically won't sort them by datetime. You'd need to use datetime.strptime() first:

from datetime import datetime
data_as_dict = {datetime.strptime(l[0], '%Y-%m-%d %H:%M:%S'): l[1:] for l in data}    
print(data_as_dict.get(datetime(2017,5,17,14,1,13)))
# [0.35977, 0.50052, 7]

print(data_as_dict.get(datetime(2017,5,17,14,1,10)))
# None

list with binary search

If you're looking for timestamps ranges, a dict won't help you much. A binary search (e.g. with bisect) on a list of timestamps should be very fast.

import bisect
timestamps = [datetime.strptime(l[0], '%Y-%m-%d %H:%M:%S') for l in data]
i = bisect.bisect(timestamps, datetime(2017,5,17,14,1,10))
print(data[i-1])
# ['2017-05-15 05:20:36', 0.01719, 0.51249, 7]
print(data[i])
# ['2017-05-17 14:01:13', 0.35977, 0.50052, 7]

Database

Before reinventing the wheel, you might want to dump all your CSVs into a small database (sqlite, Postgresql, ...) and use the corresponding queries.

Pandas

If you don't want the added complexity of a database but are ready to invest some time learning a new syntax, you should use pandas.DataFrame. It does exactly what you want, and then some.

large nested lists versus dictionaries

Answers (1)

dict

list with binary search

Database

Pandas

Related Questions