Reputation: 80

Langchain UnstructuredDirectoryLoader Timeout error

I'm trying to load a very large complex PDF that contains tables and figures. Its roughly 600 pages. When I use the fast option with Unstructured API in Langchain-JS with NextJS it seems to work but doesn't gather some necessary data. However when use the hi_res option it gives me a timeout error. I've tried setting the timeout option to various settings to no avail. I'm perfectly ok with the process taking as much time as it needs. Any help would be very much appreciated.

ERROR:

error TypeError: fetch failed
    at Object.fetch (node:internal/deps/undici/undici:11576:11)
    at UnstructuredLoader._partition (e:/Web-Development/Developing/Nextjs/projects/gpt4-pdf/node_modules/langchain/dist/document_loaders/fs/unstructured.js:139:26)
    at UnstructuredLoader.load (e:/Web-Development/Developing/Nextjs/projects/gpt4-pdf/node_modules/langchain/dist/document_loaders/fs/unstructured.js:154:26)
    at UnstructuredDirectoryLoader.load (e:/Web-Development/Developing/Nextjs/projects/gpt4-pdf/node_modules/langchain/dist/document_loaders/fs/directory.js:80:40)
    at run (e:\Web-Development\Developing\Nextjs\projects\gpt4-pdf\scripts\ingest.ts:48:21)
    at <anonymous> (e:\Web-Development\Developing\Nextjs\projects\gpt4-pdf\scripts\ingest.ts:78:3) {
cause: HeadersTimeoutError: Headers Timeout Error
    at Timeout.onParserTimeout [as callback] (node:internal/deps/undici/undici:9748:32)
    at Timeout.onTimeout [as _onTimeout] (node:internal/deps/undici/undici:8047:17)
    at listOnTimeout (node:internal/timers:573:17)
    at process.processTimers (node:internal/timers:514:7) {
code: 'UND_ERR_HEADERS_TIMEOUT'
 }
}

The code I'm using where error occurs:

const options = {
    apiKey: process.env.UNSTRUCTURED_API_KEY,
    strategy: "hi_res",
    timeout: 10000, //Tried various from 10000-10000000
};

const unstructuredLoader = new UnstructuredDirectoryLoader(
  filePath,
  options
);

const rawDocs = await unstructuredLoader.load();

Upvotes: 2

Answers (2)

TOMARTISAN

Reputation: 1531

this worked for me when use strategy: "hi_res",

https://github.com/langchain-ai/langchainjs/issues/1856

Upvotes: 0

Yilmaz

Reputation: 49182

this is the type for options:

export type UnstructuredLoaderOptions = {
    apiKey?: string;
    apiUrl?: string;
    strategy?: StringWithAutocomplete<UnstructuredLoaderStrategy>;
    encoding?: string;
    ocrLanguages?: Array<string>;
    coordinates?: boolean;
    pdfInferTableStructure?: boolean;
    xmlKeepTags?: boolean;
};
type UnstructuredDirectoryLoaderOptions = UnstructuredLoaderOptions & {
    recursive?: boolean;
    unknown?: UnknownHandling;
};

you should choose a strategy:

 strategy?: StringWithAutocomplete<UnstructuredLoaderStrategy>;

type of strategy

 type UnstructuredLoaderStrategy = "hi_res" | "fast" | "ocr_only" | "auto"

Maybe 600 pages is too much for UnstructuredDirectoryLoader. choose fast strategy. from here

Unstructured document loader allow users to pass in a strategy parameter that lets unstructured know how to partition the document. Currently supported strategies are "hi_res" (the default) and "fast". Hi res partitioning strategies are more accurate, but take longer to process. Fast strategies partition the document more quickly, but trade-off accuracy. Not all document types have separate hi res and fast partitioning strategies. For those document types, the strategy kwarg is ignored. In some cases, the high res strategy will fallback to fast if there is a dependency missing (i.e. a model for document partitioning).

Upvotes: 0

Langchain UnstructuredDirectoryLoader Timeout error

Answers (2)

Related Questions