lurum28
lurum28

Reputation: 1

Problem with RAM when creating a DataFrame with a large number of columns from a TensorFlow Dataset

I work with large amounts of data which I process using TensorFlow Dataset (TFDS) and save to pandas.DataFrame. My goal is to convert the data from one format to another for further analysis. But when I create a DataFrame with a large number of columns (~8500), my RAM fills up quickly and the process terminates with a low memory error.

Current code:

import tensorflow as tf
import pandas as pd
from tqdm import tqdm

datapoint_indices = [x[0] for x in filtered_ranking_table]

# Empty DataFrame to store results
column_names = ["class"]
column_names += [f'datapoint_{i}' for i in datapoint_indices]
# df = pd.DataFrame(columns=column_names)
# max_rows = 114003  # or some other upper limit
# df = pd.DataFrame({name: [None] * 162078 for name in column_names})

# Trying to create a DataFrame with a fixed number of rows
# max_rows = 114003  # Row limit
# df = pd.DataFrame(index=range(max_rows), columns=column_names)

df = pd.DataFrame({name: [np.nan] * 162078 for name in column_names})

for datapoint_n, clusters in tqdm(dataset.take(114003), total=114003):
    if datapoint_n.numpy() in datapoint_indices:
        prev_index = len(df)  # Current length of df
        for i, cluster in enumerate(clusters):
            cluster = cluster.numpy()
            cluster = [x for x in cluster if x != 0]
            df.loc[prev_index:prev_index + len(cluster) - 1, 'class'] = i
            df.loc[prev_index:prev_index + len(cluster) - 1, f'datapoint_{datapoint_n}'] = pd.Series(cluster, index=range(prev_index, prev_index + len(cluster)))
            prev_index += len(cluster)

df = df.dropna(how='all')
df = df.astype({"class": int})

What I've tried so far:

Questions:

  1. How can this process be optimised to reduce memory consumption?
  2. Is there any way to write the data directly to a file (like Parquet, CSV or HDF5) instead of loading it into RAM?
  3. What approaches can help with this amount of data and number of columns?

Any tips on optimisation or approaches to save the data directly to a file would be appreciated.

Upvotes: 0

Views: 37

Answers (0)

Related Questions