Reputation: 2417
I have a dataset in the format of a list of dicts that I would like to loop through and extract a subset of that dataset based on a match to a value in another list of values.
I am currently doing this in the way of two seperate for x in y
loops, as shown below under sample code, but I'm sure that this is very inefficient and it's taking an extremely long time with large lists to look through.
example data in CSV format:
╔══════════════╦══════════════╦═══════════════╦════════════════╦══════════════════════╗
║ City ║ State ║ 2013 Estimate ║ 2013 Land Area ║ 2013 Popular Density ║
╠══════════════╬══════════════╬═══════════════╬════════════════╬══════════════════════╣
║ New York ║ New York ║ 8405837 ║ 302.6 sq mi ║ 27012 per sq mi ║
║ Los Angeles ║ California ║ 3884307 ║ 468.7 sq mi ║ 8092 per sq mi ║
║ Chicago ║ Illinois ║ 2718782 ║ 227.6 sq mi ║ 11842 per sq mi ║
║ Houston ║ Texas ║ 2195914 ║ 599.6 sq mi ║ 3501 per sq mi ║
║ Philadelphia ║ Pennsylvania ║ 1,553,165 ║ 134.1 sq mi ║ 11379 per sq mi ║
║ Phoenix ║ Arizona ║ 1513367 ║ 516.7 sq mi ║ 2798 per sq mi ║
║ San Antonio ║ Texas ║ 1409019 ║ 460.9 sq mi ║ 2880 per sq mi ║
║ San Diego ║ California ║ 1355896 ║ 325.2 sq mi ║ 4020 per sq mi ║
║ Dallas ║ Texas ║ 1257676 ║ 340.5 sq mi ║ 3518 per sq mi ║
║ San Jose ║ California ║ 998537 ║ 176.5 sq mi ║ 5359 per sq mi ║
╚══════════════╩══════════════╩═══════════════╩════════════════╩══════════════════════╝
sample code
#read data into list of dicts
import csv
with open('data.csv', 'rb') as csv_file:
data = list(csv.DictReader(csv_file))
# cities of interest to extract from larger data
int_cities = [['New York'],['Houston'],['Pheonix'],['San Jose']]
# loop through data and look for match in data['City'] and interest_cities, append match to int_cities_data
int_cities_data = []
for i in data:
for u in int_cities:
if i['City'] == u:
int_cities_data.append(i)
As I state, this currently works, but it takes a very long time when I have to loop through ~2M rows in data
and look if there is a match across another 50k rows in int_cities
.
How can I make this more efficient?
I forgot that the data is too large to use csv.DictReader
so I have been using the following to read my data into a list of dicts (after removing the header):
This is untested
header = ['City','State','2013 Estimate','2013 Land Area','2013 Popular Density']
data = [{key: value for (key, value) in zip(header, line.strip().split(','))} for line in open('data.csv') if line['City'] in int_cities]
I tried to modify the above code I've used to load my data into a list of dicts without using csv.DictReader
.
Upvotes: 1
Views: 131
Reputation: 94871
Instead of reading all the data in the file into a list, then iterating over that list to search for the cities you want, iterate over the csv file one line at a time, and only add items to the list if they're for the cities you care about. That way you don't need to store the entire file in memory, and you don't need to iterate over it twice (once to build the complete list, then again to pull the entries you care about out of it).
Additionally, store the cities you care about in a set
instead of a list
, so you can do lookups in O(1)
time, instead of O(n)
. This will likely drastically improve performance if you're doing lots of lookups (and it sounds like you are).
#read data into list of dicts
import csv
int_cities = set(['New York', 'Houston', 'Phoenix', 'San Jose'])
int_cities_data = []
with open('data.csv', 'rb') as csv_file:
for line in csv.DictReader(csv_file):
if line['City'] in int_cities:
int_cities_data.append(line)
Or as a list comprehension:
with open('data.csv', 'rb') as csv_file:
int_cities_data = [line for line in csv.DictReader(csv_file) if line['City'] in int_cities]
Upvotes: 4