Reputation: 1
I work with large amounts of data which I process using TensorFlow Dataset (TFDS) and save to pandas.DataFrame. My goal is to convert the data from one format to another for further analysis. But when I create a DataFrame with a large number of columns (~8500), my RAM fills up quickly and the process terminates with a low memory error.
Current code:
import tensorflow as tf
import pandas as pd
from tqdm import tqdm
datapoint_indices = [x[0] for x in filtered_ranking_table]
# Empty DataFrame to store results
column_names = ["class"]
column_names += [f'datapoint_{i}' for i in datapoint_indices]
# df = pd.DataFrame(columns=column_names)
# max_rows = 114003 # or some other upper limit
# df = pd.DataFrame({name: [None] * 162078 for name in column_names})
# Trying to create a DataFrame with a fixed number of rows
# max_rows = 114003 # Row limit
# df = pd.DataFrame(index=range(max_rows), columns=column_names)
df = pd.DataFrame({name: [np.nan] * 162078 for name in column_names})
for datapoint_n, clusters in tqdm(dataset.take(114003), total=114003):
if datapoint_n.numpy() in datapoint_indices:
prev_index = len(df) # Current length of df
for i, cluster in enumerate(clusters):
cluster = cluster.numpy()
cluster = [x for x in cluster if x != 0]
df.loc[prev_index:prev_index + len(cluster) - 1, 'class'] = i
df.loc[prev_index:prev_index + len(cluster) - 1, f'datapoint_{datapoint_n}'] = pd.Series(cluster, index=range(prev_index, prev_index + len(cluster)))
prev_index += len(cluster)
df = df.dropna(how='all')
df = df.astype({"class": int})
What I've tried so far:
DataFrame
with fixed rows (max_rows
) and dynamic number of columns (datapoint_indices
).for
loop to fill data сolumn by column in blocks as in the code above, which helps for small number of columns, but fails for 8500+ columns due to lack of RAM.Questions:
Parquet
, CSV
or HDF5
) instead of loading it into RAM?Any tips on optimisation or approaches to save the data directly to a file would be appreciated.
Upvotes: 0
Views: 37