Ja4H3ad
Ja4H3ad

Reputation: 61

LangChain / OpenAI issue with format of text file from webscraping causing API call to fail for "maximum context length"

I'm attempting to use Retrieval Augmented Generation using LangChain's TextLoader and CharacterTextSplitter. My source data is text data that I've scraped from a customer's website. When scraped without preprocessing, the data is dirty and contains unicode and NBSP, as well as linebreaks ('\n'). Here is my webscraping script:

for procedure_name in procedure_names:
    url = f"{base_url}{procedure_name}/"
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, "html.parser")
        content_div = soup.find("div", class_="col-lg-8")
        if content_div:
            article = content_div.find("article", class_="singlepostinner")
            if article:
                paragraphs = article.find_all("p")
                procedure_content = "\n".join(p.get_text() for p in paragraphs)  # Extract content from paragraphs

                # Cleaning process
                cleaned_content = re.sub(r'[^\w\s]', '', procedure_content)  # Remove punctuation
                cleaned_content = cleaned_content.replace('\n', ' ')  # Remove line breaks
                cleaned_content = re.sub(r'\\u[0-9a-fA-F]+', '', cleaned_content)  # Remove Unicode characters
                cleaned_content = cleaned_content.replace('\xa0', ' ')  # Replace non-breaking space with regular space
                # Preserve original web formatting for certain non-breaking space characters (NBSP) such as superscript "TM"
                cleaned_content = html.unescape(cleaned_content)

                # Print the cleaned content for debugging
                print("Cleaned Content:")
                print(cleaned_content)

                # Append the cleaned content to the text file
                with open('webscraped_data.txt', 'a', encoding='utf-8') as txt_file:
                    txt_file.write(cleaned_content + '\n')
            else:
                print(f"No article found for {procedure_name}")
        else:
            print(f"No content div found for {procedure_name}")
    else:
        print(f"Failed to fetch {url}")

Now, when I attempt to run the LangChain chain, using the examples from their website as well as from other tutorials like so, this is where I see the error. My LangChain function is below.

def demo(question):
    '''
    Takes in a question about medical treatments and returns the most relevant
    part of the description. 
    
    Follow the steps below to fill out this function:
    '''
    # PART ONE:
    
    loader = TextLoader('/Users/ty/repos/vee/App/webscraped_data_clean.txt')
    documents = loader.load()
     
    
    # PART TWO
    # Split the document into chunks (you choose how and what size)
    # text_splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=1000)
    text_splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=1000)
    docs = text_splitter.split_documents(documents)
    
    # PART THREE
    # EMBED THE Documents (now in chunks) to a persisted ChromaDB
    embedding_function = OpenAIEmbeddings()
    db = Chroma.from_documents(docs, embedding_function, persist_directory='./Users/ty/repos/vee/App/.solution')
    db.persist
     

    # PART FOUR
    # Use ChatOpenAI and ContextualCompressionRetriever to return the most
    # relevant part of the documents.
    llm = ChatOpenAI(temperature=0)
    compressor = LLMChainExtractor.from_llm(llm)
    compression_retreiver = ContextualCompressionRetriever(base_compressor=compressor,
                                                          base_retriever=db.as_retriever())
    compressed_docs=compression_retreiver.get_relevant_documents(question)

     

    print(compressed_docs[0].page_content)



text_splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=1000)

In order to troubleshoot, I used the standard US_Constitution.txt file that is represented on the site, as well as in other tutorials. This of course works. I also measured the number of tokens to further troubleshoot:

For the US Constitution doc:

with open('some_data/US_Constitution.txt', 'r', encoding='utf-8') as text_file:
    content = text_file.read()
    estimated_tokens = estimate_tokens(content)
    print(f"Estimated tokens in the content: {estimated_tokens}")

I get 7497 tokens.

With my webscraped data using the same approach, I get 9145. So clearly, both documents surpass the 4096 token limit. But my thinking was that since I vectorized the data, my local data would be searched FIRST semantically for the question ("what is rosacea") BEFORE sending to OpenAI. It seems with the Constitution document, this is the case. However, it seems with my webscraped document, it is sending the entirety of the corpus, not the semantically similar portion of text.

Has anyone run across this issue with webscraped text, and if so, how did you fix it? Please also note that I clipped the file down to 852 tokens and I still get the same error for the API call (exceeded 4096 tokens)

Upvotes: 1

Views: 883

Answers (1)

ZKS
ZKS

Reputation: 2866

Below these two lines of your code you can add additional functionality to check token count

from tiktoken import Tokenizer, tokenizers

def count_tokens(text):
    tokenizer = Tokenizer(tokenizers.ByteLevelBPETokenizer())
    return sum(1 for _ in tokenizer.tokenize(text))

#Your existing code 
    text_splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=1000)
    docs = text_splitter.split_documents(documents)
#Here you can check token count and make sure its less than 4096
    for idx, doc in enumerate(docs):
        token_count = count_tokens(doc)
        print(f"Document {idx} has {token_count} tokens.")

        if token_count > 4096:
            print("Warning: This document exceeds the token limit.")

Upvotes: 0

Related Questions