frickskit
frickskit

Reputation: 624

Pivot in SQLAlchemy

Imagine a table primary keyed by date and name.

|Date|Name|Value|
|D1  |   A|    2|
|D1  |   B|    3|
|D2  |   A|    3|
|D2  |   C|    1|

How would one write a SQLAlchemy query yielding (date, value) pairs where the values are all of the values for all of the names on that date? The length of the value tuple would be len(distinct(Name)) and missing values would contain some sentinel value (e.g. None).

i.e.
[
(D1, 2, 3, None),
(D2, 3, None, 1),
...
]

I'm not asking for a full answer here in order to get me to approve. If you can tell me what concepts to search for/read in the documentation (or really, point me to anything useful or related), I'll (try to) figure it out myself and post an update.

[No pandas solutions please]

The easy answer is to load the query in long form and do the pivoting in numpy using numpy.searchsorted, but if the submitted answer uses this, I would be interested in some performance diagnostics to compare with the simple solution.

Upvotes: 1

Views: 7467

Answers (1)

van
van

Reputation: 77032

I would recommend not to do it in on the SQL Server side, unless you have a huge amount of data (and even in this case I am not sure it has significant positive performance impact).

Instead, use different tools where they excel:

  • MS-SQL (or other RDBMS) to store data
  • SQLAlchemy to insert/query that data
  • numpy routines or pandas.pivot to pivot the data

The code below is self-contained, running (with sqlite) and shows how to do get to the final result you desire:

from sqlalchemy import create_engine, Column, Integer, String
from sqlalchemy.orm import sessionmaker
from sqlalchemy.ext.declarative import declarative_base
import pandas as pd

_db_uri = 'sqlite:///:memory:'
engine = create_engine(_db_uri, echo=True)
Session = sessionmaker(bind=engine)

Base = declarative_base()

class MyTable(Base):
    __tablename__ = 'my_table'
    id = Column(Integer, primary_key=True)
    date = Column(String)
    name = Column(String)
    value = Column(Integer)

def _test():
    Base.metadata.drop_all(engine)
    Base.metadata.create_all(engine)
    session = Session()

    def _add_test_data():
        rows = [
            MyTable(date="D1", name="A", value=2),
            MyTable(date="D1", name="B", value=3),
            MyTable(date="D2", name="A", value=3),
            MyTable(date="D2", name="C", value=1),
        ]
        session.add_all(rows)
        session.commit()

    # create test data
    _add_test_data()

    # use `sa` to query data from the database
    # q = session.query(MyTable)  # all columns
    q = session.query(MyTable.date, MyTable.name, MyTable.value)  # explicit

    # read data into pandas directly from the query `q`
    df = pd.read_sql(q.statement, q.session.bind)
    print(df)

    # pivot the results
    df_pivot = df.pivot(index="date", columns="name", values="value")
    print(df_pivot)


if __name__ == '__main__':
    _test()

Upvotes: 9

Related Questions