Reputation: 80
I'm trying to load a very large complex PDF that contains tables and figures. Its roughly 600 pages. When I use the fast option with Unstructured API in Langchain-JS with NextJS it seems to work but doesn't gather some necessary data. However when use the hi_res option it gives me a timeout error. I've tried setting the timeout option to various settings to no avail. I'm perfectly ok with the process taking as much time as it needs. Any help would be very much appreciated.
ERROR:
error TypeError: fetch failed
at Object.fetch (node:internal/deps/undici/undici:11576:11)
at UnstructuredLoader._partition (e:/Web-Development/Developing/Nextjs/projects/gpt4-pdf/node_modules/langchain/dist/document_loaders/fs/unstructured.js:139:26)
at UnstructuredLoader.load (e:/Web-Development/Developing/Nextjs/projects/gpt4-pdf/node_modules/langchain/dist/document_loaders/fs/unstructured.js:154:26)
at UnstructuredDirectoryLoader.load (e:/Web-Development/Developing/Nextjs/projects/gpt4-pdf/node_modules/langchain/dist/document_loaders/fs/directory.js:80:40)
at run (e:\Web-Development\Developing\Nextjs\projects\gpt4-pdf\scripts\ingest.ts:48:21)
at <anonymous> (e:\Web-Development\Developing\Nextjs\projects\gpt4-pdf\scripts\ingest.ts:78:3) {
cause: HeadersTimeoutError: Headers Timeout Error
at Timeout.onParserTimeout [as callback] (node:internal/deps/undici/undici:9748:32)
at Timeout.onTimeout [as _onTimeout] (node:internal/deps/undici/undici:8047:17)
at listOnTimeout (node:internal/timers:573:17)
at process.processTimers (node:internal/timers:514:7) {
code: 'UND_ERR_HEADERS_TIMEOUT'
}
}
The code I'm using where error occurs:
const options = {
apiKey: process.env.UNSTRUCTURED_API_KEY,
strategy: "hi_res",
timeout: 10000, //Tried various from 10000-10000000
};
const unstructuredLoader = new UnstructuredDirectoryLoader(
filePath,
options
);
const rawDocs = await unstructuredLoader.load();
Upvotes: 2
Views: 780
Reputation: 1531
this worked for me when use strategy: "hi_res",
https://github.com/langchain-ai/langchainjs/issues/1856
Upvotes: 0
Reputation: 49182
this is the type for options:
export type UnstructuredLoaderOptions = {
apiKey?: string;
apiUrl?: string;
strategy?: StringWithAutocomplete<UnstructuredLoaderStrategy>;
encoding?: string;
ocrLanguages?: Array<string>;
coordinates?: boolean;
pdfInferTableStructure?: boolean;
xmlKeepTags?: boolean;
};
type UnstructuredDirectoryLoaderOptions = UnstructuredLoaderOptions & {
recursive?: boolean;
unknown?: UnknownHandling;
};
you should choose a strategy:
strategy?: StringWithAutocomplete<UnstructuredLoaderStrategy>;
type of strategy
type UnstructuredLoaderStrategy = "hi_res" | "fast" | "ocr_only" | "auto"
Maybe 600 pages is too much for UnstructuredDirectoryLoader
. choose fast
strategy. from here
Unstructured document loader allow users to pass in a strategy parameter that lets unstructured know how to partition the document. Currently supported strategies are "hi_res" (the default) and "fast". Hi res partitioning strategies are more accurate, but take longer to process. Fast strategies partition the document more quickly, but trade-off accuracy. Not all document types have separate hi res and fast partitioning strategies. For those document types, the strategy kwarg is ignored. In some cases, the high res strategy will fallback to fast if there is a dependency missing (i.e. a model for document partitioning).
Upvotes: 0