YorHa2B
YorHa2B

Reputation: 35

Memory Usage Continues to Increase When Repeatedly Reading Parquet files

I'm experiencing an issue where my Python script's memory usage continuously increases during repeated processing of Parquet files using PyArrow—even after explicitly deleting objects and forcing garbage collection. The behavior is similar when I use other libraries like Polars and Pandas.

I have a function that reads a Parquet file, filters rows based on a date range, and then attempts to free memory. I've tried explicit del statements, calling gc.collect(), and even using PyArrow’s memory pool functions like pa.jemalloc_set_decay_ms(0) and pool.release_unused(). Despite these efforts, the resident memory usage of my process keeps growing over successive iterations.

Here’s a simplified version of my code:

import psutil
import time
import gc
from datetime import datetime
import pyarrow.parquet as pq
import pyarrow.compute as pc
import pyarrow as pa

def print_memory_usage():
    process = psutil.Process()
    mem_info = process.memory_info()
    print(f"Memory Usage: {mem_info.rss / 1024 / 1024:.2f} MB")

def mem_and_time(func):
    def wrapper(*args, **kwargs):
        start_time = time.time()
        result = func(*args, **kwargs)
        end_time = time.time()
        print(f"Execution Time: {end_time - start_time:.6f} seconds")
        print_memory_usage()
        return result
    return wrapper

@mem_and_time
def test_func_pyarrow():
    # Read Parquet file into a PyArrow Table
    # Memory usage increasing also happens when I call 
    # pd.read_parquet or pl.read_parquet
    table = pq.read_table("/Users/test.parquet")
    del table
    gc.collect()
    pa.jemalloc_set_decay_ms(0)
    pool = pa.default_memory_pool()
    pool.release_unused()

    return None

if __name__ == "__main__":
    # Run the function repeatedly
    for _ in range(1000):
        test_func_pyarrow()
    time.sleep(20)

When running this script, the memory usage output shows a steady increase:

Execution Time: 0.985749 seconds
Memory Usage: 4542.09 MB
Execution Time: 0.873830 seconds
Memory Usage: 5926.19 MB
...
Execution Time: 0.774829 seconds
Memory Usage: 7985.73 MB

How can I handle this issue? It also happens when I call pd.read_parquet or pl.read_parquet. Any insights or recommendations would be greatly appreciated.

Python: 3.13.1

Pandas: 2.2.3

PyArrow: 19.0.0

OS: MacOs Sequoia 15.3.1

Upvotes: 3

Views: 89

Answers (0)

Related Questions