user20603914
user20603914

Reputation:

How can I efficiently handle filtering and processing large datasets in Python with limited memory?

I'm working with a large dataset (around 1 million records) represented as a list of dictionaries in Python. Each dictionary has multiple fields, and I need to filter the data based on several conditions, then process the filtered results. The main challenge is that the dataset is too large to fit into memory all at once, and I need an efficient solution to both filter and process the data in a memory-conscious manner.

Here’s a simplified version of what I’m trying to achieve:

Filter records where age > 25 and status == 'active'. For the filtered records, extract certain fields, such as name and email, and process them (e.g., convert names to lowercase, extract domain from emails).

# Sample dataset
data = [
    {'name': 'Alice', 'age': 30, 'status': 'active', 'email': '[email protected]'},
    {'name': 'Bob', 'age': 22, 'status': 'inactive', 'email': '[email protected]'},
    {'name': 'Charlie', 'age': 35, 'status': 'active', 'email': '[email protected]'},
    # More records...
]

# Attempted approach
def process_record(record):
    # Process the record, e.g., lowercase name, extract email domain
    record['name'] = record['name'].lower()
    record['email_domain'] = record['email'].split('@')[1]
    return record

filtered_and_processed = []
for record in data:
    if record['age'] > 25 and record['status'] == 'active':
        processed_record = process_record(record)
        filtered_and_processed.append(processed_record)

# Output the results
print(filtered_and_processed)

Upvotes: 1

Views: 108

Answers (1)

C.Nivs
C.Nivs

Reputation: 13106

The best way to handle this would be through using iteration over a stream of records rather than aggregating them all into memory. There are a few ways you can do this.

Approach 1: Record-by-Record Stream

You can accomplish this by using the csv module and iterate directly over the file handle:

import csv

with open('yourfile.csv', newline='') as infile, open('outfile.csv', 'w', newline='') as outfile:
     reader = csv.DictReader(infile)
     writer = csv.DictWriter(outfile)

     # Iterate directly over the reader
     for row in reader:
         if row['age'] < 25 or not row['status'] == 'active':
             continue
        
         # Write each row, don't aggregate
         writer.writerow(process_record(row))

This will handle all of your records in a stream and is very memory efficient.

Approach 2: Use pandas batching from a csv file:

This utilizes the pandas library to batch a certain amount of records per iterative step (1M as an example). This could leverage vectorization for your processing, but I have not made edits to vectorize your approach

import pandas as pd

with pd.read_csv('yourfile.csv', chunksize=10**6) as reader:
    # Iterate over the chunks
    for chunk in reader:
        # do the filtering first, then apply your function
        df = chunk[(chunk['age'] > 25) & (chunk['status'] == 'active')].apply(process_record)
        # append to the outfile.csv
        df.to_csv('outfile.csv', mode='a', index=False, header=False)

Upvotes: 5

Related Questions