Reputation: 685
I am new to anndata and would like to know if an issue that i am running into expected or not.
I have 28 h5ad files (Tabula Sapiens)(https://figshare.com/articles/dataset/Tabula_Sapiens_v2/27921984), that I am trying to consolidate into one h5ad file in order to calculate some statistics. I used the code below:
# Initialize an empty AnnData object
merged_adata = None
input_files = os.listdir("/full_data/ts_individual_data/}"
# Read and concatenate each file
for file in input_files:
print(f"Processing file: {file}")
adata = sc.read_h5ad(f"/full_data/ts_individual_data/{file}")
print(f"Read Done for {file}")
if merged_adata is None:
merged_adata = adata
else:
merged_adata = ad.concat([merged_adata, adata], axis=0, join='outer', merge='unique')
# Write the merged data to the output file in chunks
print(f"Writing merged data to {output_file}")
merged_adata.write_h5ad(output_file)
The total size of the individual tissue h5ads is 53gb where as the size of the combined file is 222gb
I have checked for duplicates in the var and obs layers: No duplicates were found I have compared expressed values for a couple of cell_ids against one tissue in the combined file and the individual tissue h5ad: Both values match.
Due to this im running into memory issues. What could be done to resolve this? Is the approach to combine the h5ad files correct?
Upvotes: 0
Views: 33