hopeman
hopeman

Reputation: 151

dlt forces to install pyarrow for duckdb

It's very basic pipeline to load some data into duckdb with dlthub.

Error

dlt.pipeline.exceptions.PipelineStepFailed: Pipeline execution failed at stage extract when processing package 1737502505.134826 with exception:

<class 'dlt.common.exceptions.MissingDependencyException'>

You must install additional dependencies to run dlt pyarrow helpers. If you use pip you may do the following:

pip install "dlt[parquet]"

Install pyarrow to be allow to load arrow tables, panda frames and to use parquet files.

I'm not reading data from any local files, so I don't see a point of installing pyarrow. How to fix that error without installing dlt[parquet] (I suppose it will install pyarrow)?

The code The requirements file:

duckdb
dlt[duckdb]>=1.5.0
yfinance

Source

@dlt.source(name="yahoo")
def source_yahoo(ticker):
    @dlt.resource(primary_key="id", write_disposition="merge")
    def prices_and_dividends():
        yield yf.Ticker(ticker).history(period="1y")

    yield prices_and_dividends()

Pipeline

def load_prices(source) -> None:
    pipeline = dlt.pipeline(
        pipeline_name="load_prices",
        destination='duckdb',
        dataset_name="test_pipeline",
    )

    load_info = pipeline.run(source)
    print(load_info)  # noqa: T201


load_prices(source_yahoo('BRY'))

secrets.toml

destination.duckdb.credentials="duckdb:///../data/matstock.db" 

Upvotes: -1

Views: 42

Answers (1)

hopeman
hopeman

Reputation: 151

Finally I found the answer. The requirement of having dlt[parquet] comes not from the destination, but from the source: It’s an implicit choice that depends on the type of object yielded by the resource.

DLT serializes the data in the files when processing. How it's done it depends on the user settings and the object returned by the resource. It can be:

traditional route - python objects:

  • json lines
  • insert values - insert sql statements

Arrow route:

  • Arrow objects and parquet files

In the example above, the resource yields DataFrame object that is perfect for the Arrow route.

So, the solution is:

pip install dlt[parquet]

Or add it to the requirements.txt.

Source

Upvotes: 0

Related Questions