anndata.concat resulting in 4x the size of the individual files causing memory issues

Question

I am new to anndata and would like to know if an issue that i am running into expected or not.

I have 28 h5ad files (Tabula Sapiens)(https://figshare.com/articles/dataset/Tabula_Sapiens_v2/27921984), that I am trying to consolidate into one h5ad file in order to calculate some statistics. I used the code below:

# Initialize an empty AnnData object
merged_adata = None
input_files = os.listdir("/full_data/ts_individual_data/}"
# Read and concatenate each file
for file in input_files:
    print(f"Processing file: {file}")
    adata = sc.read_h5ad(f"/full_data/ts_individual_data/{file}")
    print(f"Read Done for {file}")
    
    
    if merged_adata is None:
        merged_adata = adata
    else:
        merged_adata = ad.concat([merged_adata, adata], axis=0, join='outer', merge='unique')

# Write the merged data to the output file in chunks
print(f"Writing merged data to {output_file}")
merged_adata.write_h5ad(output_file)

The total size of the individual tissue h5ads is 53gb where as the size of the combined file is 222gb

I have checked for duplicates in the var and obs layers: No duplicates were found I have compared expressed values for a couple of cell_ids against one tissue in the combined file and the individual tissue h5ad: Both values match.

Due to this im running into memory issues. What could be done to resolve this? Is the approach to combine the h5ad files correct?

anndata.concat resulting in 4x the size of the individual files causing memory issues

Answers (0)

Related Questions