Danish Zahid Malik
Danish Zahid Malik

Reputation: 685

anndata.concat resulting in 4x the size of the individual files causing memory issues

I am new to anndata and would like to know if an issue that i am running into expected or not.

I have 28 h5ad files (Tabula Sapiens)(https://figshare.com/articles/dataset/Tabula_Sapiens_v2/27921984), that I am trying to consolidate into one h5ad file in order to calculate some statistics. I used the code below:

# Initialize an empty AnnData object
merged_adata = None
input_files = os.listdir("/full_data/ts_individual_data/}"
# Read and concatenate each file
for file in input_files:
    print(f"Processing file: {file}")
    adata = sc.read_h5ad(f"/full_data/ts_individual_data/{file}")
    print(f"Read Done for {file}")
    
    
    if merged_adata is None:
        merged_adata = adata
    else:
        merged_adata = ad.concat([merged_adata, adata], axis=0, join='outer', merge='unique')

# Write the merged data to the output file in chunks
print(f"Writing merged data to {output_file}")
merged_adata.write_h5ad(output_file)

The total size of the individual tissue h5ads is 53gb where as the size of the combined file is 222gb

I have checked for duplicates in the var and obs layers: No duplicates were found I have compared expressed values for a couple of cell_ids against one tissue in the combined file and the individual tissue h5ad: Both values match.

Due to this im running into memory issues. What could be done to resolve this? Is the approach to combine the h5ad files correct?

Upvotes: 0

Views: 33

Answers (0)

Related Questions