Reputation: 105
I am using modin in combination with ray to read a huge csv file (56GB with 1,5 billion rows). I sorted the data beforehand using linux sort.
The following code results in multiple workers being killed due to out of memory pressure and I doubt that the computation is efficient / will ever run through.
I am using Ray on a local machine with 48 cores and 126GB of RAM.
How would I tackle this issue efficiently? Unfortunately I cannot access the web-interface to check on things since it is hosted on a ubuntu server version with no access through the firewall.
Code:
import modin.pandas as pd
import ray
ray.init()
df = pd.read_csv("./file", index_col=0, header=None, names=["1", "2"])
df.groupby('1').sum()
RayContext:
RayContext(dashboard_url='127.0.0.1:8265', python_version='3.8.10', ray_version='2.3.0', ray_commit='cf7a56b4b0b648c324722df7c99c168e92ff0b45', address_info={'node_ip_address': 'XXXXXXX', 'raylet_ip_address': 'XXXXXXX', 'redis_address': None, 'object_store_address': '/tmp/ray/session_2023-04-18_10-24-54_203554_3133/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2023-04-18_10-24-54_203554_3133/sockets/raylet', 'webui_url': '127.0.0.1:8265', 'session_dir': '/tmp/ray/session_2023-04-18_10-24-54_203554_3133', 'metrics_export_port': XXXXXX, 'gcs_address': 'XXXX', 'address': 'XXXXXXX', 'dashboard_agent_listen_port': XXXXXX, 'node_id': 'XXXXXXXXXXXXXXXXXXXXXXXXXX'})
Upvotes: 0
Views: 1131
Reputation: 11
I'm one of the core developers of Modin.
In the current implementation of read_csv
function in Modin, I can
expect up to about 2x memory overhead when reading in some cases. In your case, the peak memory consumption is already close to the maximum amount of your available memory.
What I would like to suggest to you:
read_csv
function. Have you had problems with this? I believe that in some cases, this can help, as Modin tries to find the best initialization parameters for it work.pd.DataFrame(pandas.read_csv(...))
).From our side, we will try to look for solutions to reduce memory consumption: https://github.com/modin-project/modin/issues/6018
Thank you for using Modin!
Upvotes: 1