firepower1233
firepower1233

Reputation: 1

Is there a specific way to use Langchain to split JSON into chunks?

I'm running into an error after implementing RecursiveJsonSplitter where it is saying that there is an IndexError: list index out of range. For context, I'm retrieving a JSON from MongoDB and don't really want to have any local files except the db that I have to initialize.

This is the JSON retrieval code:

    cursor = collection.find({})
    
    # return json.dumps(cursor, default=json_util.default)

    json_docs = [json.dumps(doc, default=json_util.default) for doc in cursor]
    return json_docs

This is the code I'm working on where I'm specifically failing at this line: json_chunks = text_splitter.split_json(json_data=documents)

# Get data from MongoDB
json_data = import_json_files()

# Output JSON data to a file
with open('data.txt', 'w') as f:
  json.dump(json_data, f, ensure_ascii=False, indent=4, sort_keys=True, separators=(',', ': '))

with open('client_secrets.json') as f:
    secrets = json.load(f)

os.environ["OPENAI_API_KEY"] = secrets['openai_api_key']


# Convert and ingest documents into vectorstore
json_data_str = json.dumps(json_data)
documents = json.loads(json_data_str)

if not documents:
    print("No documents found.")
    exit()

text_splitter = RecursiveJsonSplitter(max_chunk_size=1000)
json_chunks = text_splitter.split_json(json_data=documents)
embeddings = OpenAIEmbeddings()
db = FAISS.from_documents(json_chunks, embeddings)

retriever = db.as_retriever()

This is the error:

  File "/Users/abcd/Desktop/RAG/simple_example.py", line 38, in <module>
    json_chunks = text_splitter.split_json(json_data=documents)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/langchain_text_splitters/json.py", line 89, in split_json
    chunks = self._json_split(json_data)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/langchain_text_splitters/json.py", line 76, in _json_split
    self._set_nested_dict(chunks[-1], current_path, data)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/langchain_text_splitters/json.py", line 32, in _set_nested_dict
    d[path[-1]] = value
IndexError: list index out of range

I've looked at all the Langchain documentation and I don't really see a way that I can do this. Is this possible or am I banging my head against the wall?

Upvotes: 0

Views: 1721

Answers (1)

Heet Shah
Heet Shah

Reputation: 1

The documents variable is a List[Dict],whereas the RecursiveJsonSplitter.split_json() accepts Dict[str,any]. You can do either of the given below options:

  1. Set the convert_lists = True while using split_json method. This will result into multiple chunks with indices as the keys.
  2. Use create_documents method that would result into splitted Langchain documents.

Refer for more: RecursiveJsonSplitter Documentation

Although, look out for this issue. RecursiveJsonSplitter Retains State Across Invocations Due to Mutable Default Argument

Upvotes: 0

Related Questions