Olga Botvinnik
Olga Botvinnik

Reputation: 1644

Easy way to create an xarray DataSet from metadata + values?

I'm working with single-cell RNA-sequencing data which is lately 10k-100k samples (cells) x 20k features (genes) of sparse values, and also includes a lot of metadata, e.g. the tissue ("Brain" vs "Liver") of origin. The metadata is ~10-100 columns and I store as a pandas.DataFrame. Right now, I'm making xarray.DataSets by dict-ifiying the metadata and adding them as coordinates. It seems clunky and error-prone since I'm copying the snippet between notebooks. Is there an easier way?

cell_metadata_dict = cell_metadata.to_dict(orient='list')
coords = {k: ('cell', v) for k, v in cell_metadata_dict.items()}
coords.update(dict(gene=counts.columns, cell=counts.index))

ds = xr.Dataset(
    {'counts': (['cell', 'gene'], counts),
    },
    coords=coords)

EDIT:

To show some example data, here's the cell_metadata.head().to_csv():

cell,Uniquely mapped reads number,Number of input reads,EXP_ID,TAXON,WELL_MAPPING,Lysis Plate Batch,dNTP.batch,oligodT.order.no,plate.type,preparation.site,date.prepared,date.sorted,tissue,subtissue,mouse.id,FACS.selection,nozzle.size,FACS.instument,Experiment ID ,Columns sorted,Double check,Plate,Location ,Comments,mouse.age,mouse.number,mouse.sex
A1-MAA100140-3_57_F-1-1,428699,502312,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F
A10-MAA100140-3_57_F-1-1,324428,360285,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F
A11-MAA100140-3_57_F-1-1,381310,431800,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F
A12-MAA100140-3_57_F-1-1,393498,446705,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F
A2-MAA100140-3_57_F-1-1,717,918,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F

and counts.iloc[:5, :20].to_csv()

cell,0610005C13Rik,0610007C21Rik,0610007L01Rik,0610007N19Rik,0610007P08Rik,0610007P14Rik,0610007P22Rik,0610008F07Rik,0610009B14Rik,0610009B22Rik,0610009D07Rik,0610009L18Rik,0610009O20Rik,0610010B08Rik,0610010F05Rik,0610010K14Rik,0610010O12Rik,0610011F06Rik,0610011L14Rik,0610012G03Rik
A1-MAA100140-3_57_F-1-1,308,289,81,0,4,88,52,0,0,104,65,0,1,0,9,8,12,283,12,37
A10-MAA100140-3_57_F-1-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
A11-MAA100140-3_57_F-1-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
A12-MAA100140-3_57_F-1-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
A2-MAA100140-3_57_F-1-1,375,325,70,0,2,72,36,13,0,60,105,0,13,0,0,29,15,264,0,65

Re: pandas.DataFrame.to_xarray() - this is incredibly slow and it seems weird to me to encode so much both numeric and categorical data as a 100-level MultiIndex. That, and every time I've tried using MultiIndex it always results in me saying "oh and that's why I don't use MultiIndex" and reverting back to having separate metadata and counts dataframes.

Upvotes: 2

Views: 1390

Answers (1)

shoyer
shoyer

Reputation: 9603

Xarray uses pandas index/column labels for default metadata. You can convert in a single function call when all your variables share the same dimensions, but if different variables have different dimensions you need to convert them from pandas separately and then put them together on the xarray side. For example:

import pandas as pd
import io
import xarray

# read your data
cell_metadata = pd.read_csv(io.StringIO(u"""\
cell,Uniquely mapped reads number,Number of input reads,EXP_ID,TAXON,WELL_MAPPING,Lysis Plate Batch,dNTP.batch,oligodT.order.no,plate.type,preparation.site,date.prepared,date.sorted,tissue,subtissue,mouse.id,FACS.selection,nozzle.size,FACS.instument,Experiment ID ,Columns sorted,Double check,Plate,Location ,Comments,mouse.age,mouse.number,mouse.sex
A1-MAA100140-3_57_F-1-1,428699,502312,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F
A10-MAA100140-3_57_F-1-1,324428,360285,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F
A11-MAA100140-3_57_F-1-1,381310,431800,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F
A12-MAA100140-3_57_F-1-1,393498,446705,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F
A2-MAA100140-3_57_F-1-1,717,918,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F"""))
counts = pd.read_csv(io.StringIO(u"""\
cell,0610005C13Rik,0610007C21Rik,0610007L01Rik,0610007N19Rik,0610007P08Rik,0610007P14Rik,0610007P22Rik,0610008F07Rik,0610009B14Rik,0610009B22Rik,0610009D07Rik,0610009L18Rik,0610009O20Rik,0610010B08Rik,0610010F05Rik,0610010K14Rik,0610010O12Rik,0610011F06Rik,0610011L14Rik,0610012G03Rik
A1-MAA100140-3_57_F-1-1,308,289,81,0,4,88,52,0,0,104,65,0,1,0,9,8,12,283,12,37
A10-MAA100140-3_57_F-1-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
A11-MAA100140-3_57_F-1-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
A12-MAA100140-3_57_F-1-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
A2-MAA100140-3_57_F-1-1,375,325,70,0,2,72,36,13,0,60,105,0,13,0,0,29,15,264,0,65"""))

# build the output
xarray_counts = xarray.DataArray(counts.set_index('cell'), dims=['cell', 'gene'])
xarray_counts.coords.update(cell_metadata.set_index('cell').to_xarray())
print(xarray_counts)

This results in a nice, tidy xarray.DataArray for counts:

<xarray.DataArray (cell: 5, gene: 20)>
array([[308, 289,  81,   0,   4,  88,  52,   0,   0, 104,  65,   0,   1,   0,
          9,   8,  12, 283,  12,  37],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0],
       [375, 325,  70,   0,   2,  72,  36,  13,   0,  60, 105,   0,  13,   0,
          0,  29,  15, 264,   0,  65]])
Coordinates:
  * cell                          (cell) object 'A1-MAA100140-3_57_F-1-1' ...
  * gene                          (gene) object '0610005C13Rik' ...
    Uniquely mapped reads number  (cell) int64 428699 324428 381310 393498 717
    Number of input reads         (cell) int64 502312 360285 431800 446705 918
    EXP_ID                        (cell) object '170928_A00111_0068_AH3YKKDMXX' ...
    TAXON                         (cell) object 'mus' 'mus' 'mus' 'mus' 'mus'
    WELL_MAPPING                  (cell) object 'MAA100140' 'MAA100140' ...
    Lysis Plate Batch             (cell) float64 nan nan nan nan nan
    dNTP.batch                    (cell) float64 nan nan nan nan nan
    oligodT.order.no              (cell) float64 nan nan nan nan nan
    plate.type                    (cell) object 'Biorad 96well' ...
    preparation.site              (cell) object 'Stanford' 'Stanford' ...
    date.prepared                 (cell) float64 nan nan nan nan nan
    date.sorted                   (cell) int64 170720 170720 170720 170720 ...
    tissue                        (cell) object 'Liver' 'Liver' 'Liver' ...
    subtissue                     (cell) object 'Hepatocytes' 'Hepatocytes' ...
    mouse.id                      (cell) object '3_57_F' '3_57_F' '3_57_F' ...
    FACS.selection                (cell) float64 nan nan nan nan nan
    nozzle.size                   (cell) float64 nan nan nan nan nan
    FACS.instument                (cell) float64 nan nan nan nan nan
    Experiment ID                 (cell) float64 nan nan nan nan nan
    Columns sorted                (cell) float64 nan nan nan nan nan
    Double check                  (cell) float64 nan nan nan nan nan
    Plate                         (cell) float64 nan nan nan nan nan
    Location                      (cell) float64 nan nan nan nan nan
    Comments                      (cell) float64 nan nan nan nan nan
    mouse.age                     (cell) int64 3 3 3 3 3
    mouse.number                  (cell) int64 57 57 57 57 57
    mouse.sex                     (cell) object 'F' 'F' 'F' 'F' 'F'

If you want a Dataset instead, put the DataArray objects into the Dataset constructor, e.g.,

# shouldn't really need to use .data_vars here, that might be an xarray bug
>>> xarray.Dataset({'counts': xarray.DataArray(counts.set_index('cell'),
...                                            dims=['cell', 'gene'])},
...                coords=cell_metadata.set_index('cell').to_xarray().data_vars)    <xarray.Dataset>

Dimensions:                       (cell: 5, gene: 20)
Coordinates:
  * cell                          (cell) object 'A1-MAA100140-3_57_F-1-1' ...
  * gene                          (gene) object '0610005C13Rik' ...
    Uniquely mapped reads number  (cell) int64 428699 324428 381310 393498 717
    Number of input reads         (cell) int64 502312 360285 431800 446705 918
    EXP_ID                        (cell) object '170928_A00111_0068_AH3YKKDMXX' ...
    TAXON                         (cell) object 'mus' 'mus' 'mus' 'mus' 'mus'
    WELL_MAPPING                  (cell) object 'MAA100140' 'MAA100140' ...
    Lysis Plate Batch             (cell) float64 nan nan nan nan nan
    dNTP.batch                    (cell) float64 nan nan nan nan nan
    oligodT.order.no              (cell) float64 nan nan nan nan nan
    plate.type                    (cell) object 'Biorad 96well' ...
    preparation.site              (cell) object 'Stanford' 'Stanford' ...
    date.prepared                 (cell) float64 nan nan nan nan nan
    date.sorted                   (cell) int64 170720 170720 170720 170720 ...
    tissue                        (cell) object 'Liver' 'Liver' 'Liver' ...
    subtissue                     (cell) object 'Hepatocytes' 'Hepatocytes' ...
    mouse.id                      (cell) object '3_57_F' '3_57_F' '3_57_F' ...
    FACS.selection                (cell) float64 nan nan nan nan nan
    nozzle.size                   (cell) float64 nan nan nan nan nan
    FACS.instument                (cell) float64 nan nan nan nan nan
    Experiment ID                 (cell) float64 nan nan nan nan nan
    Columns sorted                (cell) float64 nan nan nan nan nan
    Double check                  (cell) float64 nan nan nan nan nan
    Plate                         (cell) float64 nan nan nan nan nan
    Location                      (cell) float64 nan nan nan nan nan
    Comments                      (cell) float64 nan nan nan nan nan
    mouse.age                     (cell) int64 3 3 3 3 3
    mouse.number                  (cell) int64 57 57 57 57 57
    mouse.sex                     (cell) object 'F' 'F' 'F' 'F' 'F'
Data variables:
    counts                        (cell, gene) int64 308 289 81 0 4 88 52 0 ...

Upvotes: 2

Related Questions