Reputation: 35
I'm experiencing an issue where my Python script's memory usage continuously increases during repeated processing of Parquet files using PyArrow—even after explicitly deleting objects and forcing garbage collection. The behavior is similar when I use other libraries like Polars and Pandas.
I have a function that reads a Parquet file, filters rows based on a date range, and then attempts to free memory. I've tried explicit del statements, calling gc.collect(), and even using PyArrow’s memory pool functions like pa.jemalloc_set_decay_ms(0) and pool.release_unused(). Despite these efforts, the resident memory usage of my process keeps growing over successive iterations.
Here’s a simplified version of my code:
import psutil
import time
import gc
from datetime import datetime
import pyarrow.parquet as pq
import pyarrow.compute as pc
import pyarrow as pa
def print_memory_usage():
process = psutil.Process()
mem_info = process.memory_info()
print(f"Memory Usage: {mem_info.rss / 1024 / 1024:.2f} MB")
def mem_and_time(func):
def wrapper(*args, **kwargs):
start_time = time.time()
result = func(*args, **kwargs)
end_time = time.time()
print(f"Execution Time: {end_time - start_time:.6f} seconds")
print_memory_usage()
return result
return wrapper
@mem_and_time
def test_func_pyarrow():
# Read Parquet file into a PyArrow Table
# Memory usage increasing also happens when I call
# pd.read_parquet or pl.read_parquet
table = pq.read_table("/Users/test.parquet")
del table
gc.collect()
pa.jemalloc_set_decay_ms(0)
pool = pa.default_memory_pool()
pool.release_unused()
return None
if __name__ == "__main__":
# Run the function repeatedly
for _ in range(1000):
test_func_pyarrow()
time.sleep(20)
When running this script, the memory usage output shows a steady increase:
Execution Time: 0.985749 seconds
Memory Usage: 4542.09 MB
Execution Time: 0.873830 seconds
Memory Usage: 5926.19 MB
...
Execution Time: 0.774829 seconds
Memory Usage: 7985.73 MB
How can I handle this issue? It also happens when I call pd.read_parquet
or pl.read_parquet
. Any insights or recommendations would be greatly appreciated.
Python: 3.13.1
Pandas: 2.2.3
PyArrow: 19.0.0
OS: MacOs Sequoia 15.3.1
Upvotes: 3
Views: 89