Reputation: 4035
How should I add a field to the metadata of Langchain's Documents?
For example, using the CharacterTextSplitter
gives a list of Documents:
const splitter = new CharacterTextSplitter({
separator: " ",
chunkSize: 7,
chunkOverlap: 3,
});
splitter.createDocuments([text]);
A document will have the following structure:
{
"pageContent": "blablabla",
"metadata": {
"name": "my-file.pdf",
"type": "application/pdf",
"size": 12012,
"lastModified": 1688375715518,
"loc": { "lines": { "from": 1, "to": 3 } }
}
}
And I want to add a field to the metadata
Upvotes: 1
Views: 4568
Reputation: 96
You have to use the Document
class, with the splitDocuments
method.
Example:
const docOutput = await splitter.splitDocuments([
new Document({pageContent: text}, metadata: {someField: "someValue"})
])
Upvotes: 0
Reputation: 816
It isn't currently shown how to do this in the recommended text splitter documentation, but the 2nd argument of createDocuments can take an array of objects whose properties will be assigned into the metadata of every element of the returned documents array.
myMetaData = { url: "https://www.google.com" }
const documents = await splitter.createDocuments([text], [myMetaData],
{ chunkHeader, appendChunkOverlapHeader: true });
After this, documents
will contain an array, with each element being an object with pageContent
and metaData
properties. Under metaData
, the properties from myMetaData
above will also appear. pageContent
will also have the text of chunkHeader prepended.
{
pageContent: <chunkHeader plus the chunk>,
metadata: <all properties of myMetaData plus loc (text line numbers of chunk)>
}
Upvotes: 1
Reputation: 4035
Ok... just loop over the docs I suppose:
for (var _doc of docs) {
_doc.metadata['doc_id'] = doc_id;
}
Upvotes: 0