Reputation: 361
I'm trying to read a git repo and parse the files from that repo. For that I'm reading files with the following code
from langchain_community.document_loaders.parsers import LanguageParser
from langchain_community.document_loaders.generic import GenericLoader
def get_git_code_documents(git_url: str, git_name: str):
if not os.path.exists(git_name):
repo = Repo.clone_from(git_url, git_name)
# branch = repo.head.main
else:
print("Repo already exists")
loader = GenericLoader.from_filesystem(
git_name,
glob="**/*",
suffixes=[".py", ".md", ".sh", ".java"],
parser=LanguageParser(),
)
documents = loader.load()
return documents
But I'm getting the following error
File "/../LLMs/codebase_openai/codebase/lib/python3.11/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 600, in _run_script
exec(code, module.__dict__)
File "/../LLMs/codebase_openai/app.py", line 56, in <module>
main()
File "/../LLMs/codebase_openai/app.py", line 29, in main
git_documents = get_git_code_documents(git_url, git_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/../LLMs/codebase_openai/llmUtils.py", line 28, in get_git_code_documents
documents = loader.load()
^^^^^^^^^^^^^
File "/../LLMs/codebase_openai/codebase/lib/python3.11/site-packages/langchain_core/document_loaders/base.py", line 29, in load
return list(self.lazy_load())
^^^^^^^^^^^^^^^^^^^^^^
File "/../LLMs/codebase_openai/codebase/lib/python3.11/site-packages/langchain_community/document_loaders/generic.py", line 116, in lazy_load
yield from self.blob_parser.lazy_parse(blob)
File "/../LLMs/codebase_openai/codebase/lib/python3.11/site-packages/langchain_community/document_loaders/parsers/language/language_parser.py", line 214, in lazy_parse
if not segmenter.is_valid():
^^^^^^^^^^^^^^^^^^^^
File "/../LLMs/codebase_openai/codebase/lib/python3.11/site-packages/langchain_community/document_loaders/parsers/language/tree_sitter_segmenter.py", line 30, in is_valid
language = self.get_language()
^^^^^^^^^^^^^^^^^^^
File "/../LLMs/codebase_openai/codebase/lib/python3.11/site-packages/langchain_community/document_loaders/parsers/language/java.py", line 26, in get_language
return get_language("java")
^^^^^^^^^^^^^^^^^^^^
File "tree_sitter_languages/core.pyx", line 14, in tree_sitter_languages.core.get_language
I installed the tree-sitter
and tree-sitter-language
but still getting the error.
But the interesting thing is the error seems to be happening only when I'm adding .java
to the suffixes
list. If I don't include .java
the code runs fine.
Any suggestions?
Upvotes: 1
Views: 632
Reputation: 160
Resolved the same by downgrading tree-sitter
to 0.21.3.
See this bug: https://github.com/grantjenks/py-tree-sitter-languages/issues/64
Upvotes: 1