Problem with RAM when creating a DataFrame with a large number of columns from a TensorFlow Dataset

Question

I work with large amounts of data which I process using TensorFlow Dataset (TFDS) and save to pandas.DataFrame. My goal is to convert the data from one format to another for further analysis. But when I create a DataFrame with a large number of columns (~8500), my RAM fills up quickly and the process terminates with a low memory error.

Current code:

import tensorflow as tf
import pandas as pd
from tqdm import tqdm

datapoint_indices = [x[0] for x in filtered_ranking_table]

# Empty DataFrame to store results
column_names = ["class"]
column_names += [f'datapoint_{i}' for i in datapoint_indices]
# df = pd.DataFrame(columns=column_names)
# max_rows = 114003  # or some other upper limit
# df = pd.DataFrame({name: [None] * 162078 for name in column_names})

# Trying to create a DataFrame with a fixed number of rows
# max_rows = 114003  # Row limit
# df = pd.DataFrame(index=range(max_rows), columns=column_names)

df = pd.DataFrame({name: [np.nan] * 162078 for name in column_names})

for datapoint_n, clusters in tqdm(dataset.take(114003), total=114003):
    if datapoint_n.numpy() in datapoint_indices:
        prev_index = len(df)  # Current length of df
        for i, cluster in enumerate(clusters):
            cluster = cluster.numpy()
            cluster = [x for x in cluster if x != 0]
            df.loc[prev_index:prev_index + len(cluster) - 1, 'class'] = i
            df.loc[prev_index:prev_index + len(cluster) - 1, f'datapoint_{datapoint_n}'] = pd.Series(cluster, index=range(prev_index, prev_index + len(cluster)))
            prev_index += len(cluster)

df = df.dropna(how='all')
df = df.astype({"class": int})

What I've tried so far:

Creating an empty DataFrame with fixed rows (max_rows) and dynamic number of columns (datapoint_indices).
Using for loop to fill data сolumn by column in blocks as in the code above, which helps for small number of columns, but fails for 8500+ columns due to lack of RAM.

Questions:

How can this process be optimised to reduce memory consumption?
Is there any way to write the data directly to a file (like Parquet, CSV or HDF5) instead of loading it into RAM?
What approaches can help with this amount of data and number of columns?

Any tips on optimisation or approaches to save the data directly to a file would be appreciated.

Problem with RAM when creating a DataFrame with a large number of columns from a TensorFlow Dataset

Answers (0)

Related Questions