How can I efficiently handle filtering and processing large datasets in Python with limited memory?

Question

I'm working with a large dataset (around 1 million records) represented as a list of dictionaries in Python. Each dictionary has multiple fields, and I need to filter the data based on several conditions, then process the filtered results. The main challenge is that the dataset is too large to fit into memory all at once, and I need an efficient solution to both filter and process the data in a memory-conscious manner.

Here’s a simplified version of what I’m trying to achieve:

Filter records where age > 25 and status == 'active'. For the filtered records, extract certain fields, such as name and email, and process them (e.g., convert names to lowercase, extract domain from emails).

# Sample dataset
data = [
    {'name': 'Alice', 'age': 30, 'status': 'active', 'email': 'alice@example.com'},
    {'name': 'Bob', 'age': 22, 'status': 'inactive', 'email': 'bob@example.com'},
    {'name': 'Charlie', 'age': 35, 'status': 'active', 'email': 'charlie@example.com'},
    # More records...
]

# Attempted approach
def process_record(record):
    # Process the record, e.g., lowercase name, extract email domain
    record['name'] = record['name'].lower()
    record['email_domain'] = record['email'].split('@')[1]
    return record

filtered_and_processed = []
for record in data:
    if record['age'] > 25 and record['status'] == 'active':
        processed_record = process_record(record)
        filtered_and_processed.append(processed_record)

# Output the results
print(filtered_and_processed)

How can I efficiently handle filtering and processing large datasets in Python with limited memory?

Answers (1)

Approach 1: Record-by-Record Stream

Approach 2: Use `pandas` batching from a csv file:

Related Questions

How can I efficiently handle filtering and processing large datasets in Python with limited memory?

Answers (1)

Approach 1: Record-by-Record Stream

Approach 2: Use pandas batching from a csv file:

Related Questions

Approach 2: Use `pandas` batching from a csv file: