Reputation: 61
I'm attempting to use Retrieval Augmented Generation using LangChain's TextLoader and CharacterTextSplitter. My source data is text data that I've scraped from a customer's website. When scraped without preprocessing, the data is dirty and contains unicode and NBSP, as well as linebreaks ('\n'). Here is my webscraping script:
for procedure_name in procedure_names:
url = f"{base_url}{procedure_name}/"
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.content, "html.parser")
content_div = soup.find("div", class_="col-lg-8")
if content_div:
article = content_div.find("article", class_="singlepostinner")
if article:
paragraphs = article.find_all("p")
procedure_content = "\n".join(p.get_text() for p in paragraphs) # Extract content from paragraphs
# Cleaning process
cleaned_content = re.sub(r'[^\w\s]', '', procedure_content) # Remove punctuation
cleaned_content = cleaned_content.replace('\n', ' ') # Remove line breaks
cleaned_content = re.sub(r'\\u[0-9a-fA-F]+', '', cleaned_content) # Remove Unicode characters
cleaned_content = cleaned_content.replace('\xa0', ' ') # Replace non-breaking space with regular space
# Preserve original web formatting for certain non-breaking space characters (NBSP) such as superscript "TM"
cleaned_content = html.unescape(cleaned_content)
# Print the cleaned content for debugging
print("Cleaned Content:")
print(cleaned_content)
# Append the cleaned content to the text file
with open('webscraped_data.txt', 'a', encoding='utf-8') as txt_file:
txt_file.write(cleaned_content + '\n')
else:
print(f"No article found for {procedure_name}")
else:
print(f"No content div found for {procedure_name}")
else:
print(f"Failed to fetch {url}")
Now, when I attempt to run the LangChain chain, using the examples from their website as well as from other tutorials like so, this is where I see the error. My LangChain function is below.
def demo(question):
'''
Takes in a question about medical treatments and returns the most relevant
part of the description.
Follow the steps below to fill out this function:
'''
# PART ONE:
loader = TextLoader('/Users/ty/repos/vee/App/webscraped_data_clean.txt')
documents = loader.load()
# PART TWO
# Split the document into chunks (you choose how and what size)
# text_splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=1000)
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=1000)
docs = text_splitter.split_documents(documents)
# PART THREE
# EMBED THE Documents (now in chunks) to a persisted ChromaDB
embedding_function = OpenAIEmbeddings()
db = Chroma.from_documents(docs, embedding_function, persist_directory='./Users/ty/repos/vee/App/.solution')
db.persist
# PART FOUR
# Use ChatOpenAI and ContextualCompressionRetriever to return the most
# relevant part of the documents.
llm = ChatOpenAI(temperature=0)
compressor = LLMChainExtractor.from_llm(llm)
compression_retreiver = ContextualCompressionRetriever(base_compressor=compressor,
base_retriever=db.as_retriever())
compressed_docs=compression_retreiver.get_relevant_documents(question)
print(compressed_docs[0].page_content)
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=1000)
In order to troubleshoot, I used the standard US_Constitution.txt file that is represented on the site, as well as in other tutorials. This of course works. I also measured the number of tokens to further troubleshoot:
For the US Constitution doc:
with open('some_data/US_Constitution.txt', 'r', encoding='utf-8') as text_file:
content = text_file.read()
estimated_tokens = estimate_tokens(content)
print(f"Estimated tokens in the content: {estimated_tokens}")
I get 7497 tokens.
With my webscraped data using the same approach, I get 9145. So clearly, both documents surpass the 4096 token limit. But my thinking was that since I vectorized the data, my local data would be searched FIRST semantically for the question ("what is rosacea") BEFORE sending to OpenAI. It seems with the Constitution document, this is the case. However, it seems with my webscraped document, it is sending the entirety of the corpus, not the semantically similar portion of text.
Has anyone run across this issue with webscraped text, and if so, how did you fix it? Please also note that I clipped the file down to 852 tokens and I still get the same error for the API call (exceeded 4096 tokens)
Upvotes: 1
Views: 883
Reputation: 2866
Below these two lines of your code you can add additional functionality to check token count
from tiktoken import Tokenizer, tokenizers
def count_tokens(text):
tokenizer = Tokenizer(tokenizers.ByteLevelBPETokenizer())
return sum(1 for _ in tokenizer.tokenize(text))
#Your existing code
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=1000)
docs = text_splitter.split_documents(documents)
#Here you can check token count and make sure its less than 4096
for idx, doc in enumerate(docs):
token_count = count_tokens(doc)
print(f"Document {idx} has {token_count} tokens.")
if token_count > 4096:
print("Warning: This document exceeds the token limit.")
Upvotes: 0