Reputation: 151
It's very basic pipeline to load some data into duckdb with dlthub.
Error
dlt.pipeline.exceptions.PipelineStepFailed: Pipeline execution failed at stage extract when processing package 1737502505.134826 with exception:
<class 'dlt.common.exceptions.MissingDependencyException'>
You must install additional dependencies to run dlt pyarrow helpers. If you use pip you may do the following:
pip install "dlt[parquet]"
Install pyarrow to be allow to load arrow tables, panda frames and to use parquet files.
I'm not reading data from any local files, so I don't see a point of installing pyarrow. How to fix that error without installing dlt[parquet]
(I suppose it will install pyarrow)?
The code The requirements file:
duckdb
dlt[duckdb]>=1.5.0
yfinance
Source
@dlt.source(name="yahoo")
def source_yahoo(ticker):
@dlt.resource(primary_key="id", write_disposition="merge")
def prices_and_dividends():
yield yf.Ticker(ticker).history(period="1y")
yield prices_and_dividends()
Pipeline
def load_prices(source) -> None:
pipeline = dlt.pipeline(
pipeline_name="load_prices",
destination='duckdb',
dataset_name="test_pipeline",
)
load_info = pipeline.run(source)
print(load_info) # noqa: T201
load_prices(source_yahoo('BRY'))
secrets.toml
destination.duckdb.credentials="duckdb:///../data/matstock.db"
Upvotes: -1
Views: 42
Reputation: 151
Finally I found the answer. The requirement of having dlt[parquet]
comes not from the destination, but from the source: It’s an implicit choice that depends on the type of object yielded by the resource.
DLT serializes the data in the files when processing. How it's done it depends on the user settings and the object returned by the resource. It can be:
traditional route - python objects:
Arrow route:
In the example above, the resource yields DataFrame object that is perfect for the Arrow route.
So, the solution is:
pip install dlt[parquet]
Or add it to the requirements.txt
.
Upvotes: 0