How to Build a Powerful and Intelligent Question-Answering System by Using Tavily Search API, Chroma, Google Gemini LLMs, and the LangChain Framework

In this tutorial, we demonstrate how to build a powerful and intelligent question-answering system by combining the strengths of Tavily Search API, Chroma, Google Gemini LLMs, and the LangChain framework. The pipeline leverages real-time web search using Tavily, semantic document caching with Chroma vector store, and contextual response generation through the Gemini model. These tools are integrated through LangChain’s modular components, such as RunnableLambda, ChatPromptTemplate, ConversationBufferMemory, and GoogleGenerativeAIEmbeddings. It goes beyond simple Q&A by introducing a hybrid retrieval mechanism that checks for cached embeddings before invoking fresh web searches. The retrieved documents are intelligently formatted, summarized, and passed through a structured LLM prompt, with attention to source attribution, user history, and confidence scoring. Key functions such as advanced prompt engineering, sentiment and entity analysis, and dynamic vector store updates make this pipeline suitable for advanced use cases like research assistance, domain-specific summarization, and intelligent agents.

!pip install -qU langchain-community tavily-python langchain-google-genai streamlit matplotlib pandas tiktoken chromadb langchain_core pydantic langchain

We install and upgrade a comprehensive set of libraries required to build an advanced AI search assistant. It includes tools for retrieval (tavily-python, chromadb), LLM integration (langchain-google-genai, langchain), data handling (pandas, pydantic), visualization (matplotlib, streamlit), and tokenization (tiktoken). These components form the core foundation for constructing a real-time, context-aware QA system.

import os
import getpass
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import json
import time
from typing import List, Dict, Any, Optional
from datetime import datetime

We import essential Python libraries used throughout the notebook. It includes standard libraries for environment variables, secure input, time tracking, and data types (os, getpass, time, typing, datetime). Additionally, it brings in core data science tools like pandas, matplotlib, and numpy for data handling, visualization, and numerical computations, as well as json for parsing structured data.

if “TAVILY_API_KEY” not in os.environ:
os.environ[“TAVILY_API_KEY”] = getpass.getpass(“Enter Tavily API key: “)

if “GOOGLE_API_KEY” not in os.environ:
os.environ[“GOOGLE_API_KEY”] = getpass.getpass(“Enter Google API key: “)

import logging
logging.basicConfig(level=logging.INFO, format=”%(asctime)s – %(name)s – %(levelname)s – %(message)s”)
logger = logging.getLogger(__name__)

We securely initialize API keys for Tavily and Google Gemini by prompting users only if they’re not already set in the environment, ensuring safe and repeatable access to external services. It also configures a standardized logging setup using Python’s logging module, which helps monitor execution flow and capture debug or error messages throughout the notebook.

from langchain_community.retrievers import TavilySearchAPIRetriever
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser, JsonOutputParser
from langchain_core.prompts import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate
from langchain_core.runnables import RunnablePassthrough, RunnableLambda
from langchain_google_genai import ChatGoogleGenerativeAI, GoogleGenerativeAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains.summarize import load_summarize_chain
from langchain.memory import ConversationBufferMemory

We import key components from the LangChain ecosystem and its integrations. It brings in the TavilySearchAPIRetriever for real-time web search, Chroma for vector storage, and GoogleGenerativeAI modules for chat and embedding models. Core LangChain modules like ChatPromptTemplate, RunnableLambda, ConversationBufferMemory, and output parsers enable flexible prompt construction, memory handling, and pipeline execution.

class SearchQueryError(Exception):
“””Exception raised for errors in the search query.”””
pass

def format_docs(docs):
formatted_content = []
for i, doc in enumerate(docs):
metadata = doc.metadata
source = metadata.get(‘source’, ‘Unknown source’)
title = metadata.get(‘title’, ‘Untitled’)
score = metadata.get(‘score’, 0)

formatted_content.append(
f”Document {i+1} [Score: {score:.2f}]:n”
f”Title: {title}n”
f”Source: {source}n”
f”Content: {doc.page_content}n”
)

return “nn”.join(formatted_content)

We define two essential components for search and document handling. The SearchQueryError class creates a custom exception to manage invalid or failed search queries gracefully. The format_docs function processes a list of retrieved documents by extracting metadata such as title, source, and relevance score and formatting them into a clean, readable string.

class SearchResultsParser:
def parse(self, text):
try:
if isinstance(text, str):
import re
import json
json_match = re.search(r'{.*}’, text, re.DOTALL)
if json_match:
json_str = json_match.group(0)
return json.loads(json_str)
return {“answer”: text, “sources”: [], “confidence”: 0.5}
elif hasattr(text, ‘content’):
return {“answer”: text.content, “sources”: [], “confidence”: 0.5}
else:
return {“answer”: str(text), “sources”: [], “confidence”: 0.5}
except Exception as e:
logger.warning(f”Failed to parse JSON: {e}”)
return {“answer”: str(text), “sources”: [], “confidence”: 0.5}

The SearchResultsParser class provides a robust method for extracting structured information from LLM responses. It attempts to parse a JSON-like string from the model output, returning to a plain text response format if parsing fails. It gracefully handles string outputs and message objects, ensuring consistent downstream processing. In case of errors, it logs a warning and returns a fallback response containing the raw answer, empty sources, and a default confidence score, enhancing the system’s fault tolerance.

class EnhancedTavilyRetriever:
def __init__(self, api_key=None, max_results=5, search_depth=”advanced”, include_domains=None, exclude_domains=None):
self.api_key = api_key
self.max_results = max_results
self.search_depth = search_depth
self.include_domains = include_domains or []
self.exclude_domains = exclude_domains or []
self.retriever = self._create_retriever()
self.previous_searches = []

def _create_retriever(self):
try:
return TavilySearchAPIRetriever(
api_key=self.api_key,
k=self.max_results,
search_depth=self.search_depth,
include_domains=self.include_domains,
exclude_domains=self.exclude_domains
)
except Exception as e:
logger.error(f”Failed to create Tavily retriever: {e}”)
raise

def invoke(self, query, **kwargs):
if not query or not query.strip():
raise SearchQueryError(“Empty search query”)

try:
start_time = time.time()
results = self.retriever.invoke(query, **kwargs)
end_time = time.time()

search_record = {
“timestamp”: datetime.now().isoformat(),
“query”: query,
“num_results”: len(results),
“response_time”: end_time – start_time
}
self.previous_searches.append(search_record)

return results
except Exception as e:
logger.error(f”Search failed: {e}”)
raise SearchQueryError(f”Failed to perform search: {str(e)}”)

def get_search_history(self):
return self.previous_searches

The EnhancedTavilyRetriever class is a custom wrapper around the TavilySearchAPIRetriever, adding greater flexibility, control, and traceability to search operations. It supports advanced features like limiting search depth, domain inclusion/exclusion filters, and configurable result counts. The invoke method performs web searches and tracks each query’s metadata (timestamp, response time, and result count), storing it for later analysis.

class SearchCache:
def __init__(self):
self.embedding_function = GoogleGenerativeAIEmbeddings(model=”models/embedding-001″)
self.vector_store = None
self.text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

def add_documents(self, documents):
if not documents:
return

try:
if self.vector_store is None:
self.vector_store = Chroma.from_documents(
documents=documents,
embedding=self.embedding_function
)
else:
self.vector_store.add_documents(documents)
except Exception as e:
logger.error(f”Failed to add documents to cache: {e}”)

def search(self, query, k=3):
if self.vector_store is None:
return []

try:
return self.vector_store.similarity_search(query, k=k)
except Exception as e:
logger.error(f”Vector search failed: {e}”)
return []

The SearchCache class implements a semantic caching layer that stores and retrieves documents using vector embeddings for efficient similarity search. It uses GoogleGenerativeAIEmbeddings to convert documents into dense vectors and stores them in a Chroma vector database. The add_documents method initializes or updates the vector store, while the search method enables fast retrieval of the most relevant cached documents based on semantic similarity. This reduces redundant API calls and improves response times for repeated or related queries, serving as a lightweight hybrid memory layer in the AI assistant pipeline.

search_cache = SearchCache()
enhanced_retriever = EnhancedTavilyRetriever(max_results=5)
memory = ConversationBufferMemory(memory_key=”chat_history”, return_messages=True)

system_template = “””You are a research assistant that provides accurate answers based on the search results provided.
Follow these guidelines:
1. Only use the context provided to answer the question
2. If the context doesn’t contain the answer, say “I don’t have sufficient information to answer this question.”
3. Cite your sources by referencing the document numbers
4. Don’t make up information
5. Keep the answer concise but complete

Context: {context}
Chat History: {chat_history}
“””

system_message = SystemMessagePromptTemplate.from_template(system_template)
human_template = “Question: {question}”
human_message = HumanMessagePromptTemplate.from_template(human_template)

prompt = ChatPromptTemplate.from_messages([system_message, human_message])

We initialize the core components of the AI assistant: a semantic SearchCache, the EnhancedTavilyRetriever for web-based querying, and a ConversationBufferMemory to retain chat history across turns. It also defines a structured prompt using ChatPromptTemplate, guiding the LLM to act as a research assistant. The prompt enforces strict rules for factual accuracy, context usage, source citation, and concise answering, ensuring reliable and grounded responses.

def get_llm(model_name=”gemini-2.0-flash-lite”, temperature=0.2, response_mode=”json”):
try:
return ChatGoogleGenerativeAI(
model=model_name,
temperature=temperature,
convert_system_message_to_human=True,
top_p=0.95,
top_k=40,
max_output_tokens=2048
)
except Exception as e:
logger.error(f”Failed to initialize LLM: {e}”)
raise

output_parser = SearchResultsParser()

We define the get_llm function, which initializes a Google Gemini language model with configurable parameters such as model name, temperature, and decoding settings (e.g., top_p, top_k, and max tokens). It ensures robustness with error handling for failed model initialization. An instance of SearchResultsParser is also created to standardize and structure the LLM’s raw responses, enabling consistent downstream processing of answers and metadata.

def plot_search_metrics(search_history):
if not search_history:
print(“No search history available”)
return

df = pd.DataFrame(search_history)

plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.plot(range(len(df)), df[‘response_time’], marker=”o”)
plt.title(‘Search Response Times’)
plt.xlabel(‘Search Index’)
plt.ylabel(‘Time (seconds)’)
plt.grid(True)

plt.subplot(1, 2, 2)
plt.bar(range(len(df)), df[‘num_results’])
plt.title(‘Number of Results per Search’)
plt.xlabel(‘Search Index’)
plt.ylabel(‘Number of Results’)
plt.grid(True)

plt.tight_layout()
plt.show()

The plot_search_metrics function visualizes performance trends from past queries using Matplotlib. It converts the search history into a DataFrame and plots two subgraphs: one showing response time per search and the other displaying the number of results returned. This aids in analyzing the system’s efficiency and search quality over time, helping developers fine-tune the retriever or identify bottlenecks in real-world usage.

def retrieve_with_fallback(query):
cached_results = search_cache.search(query)

if cached_results:
logger.info(f”Retrieved {len(cached_results)} documents from cache”)
return cached_results

logger.info(“No cache hit, performing web search”)
search_results = enhanced_retriever.invoke(query)

search_cache.add_documents(search_results)

return search_results

def summarize_documents(documents, query):
llm = get_llm(temperature=0)

summarize_prompt = ChatPromptTemplate.from_template(
“””Create a concise summary of the following documents related to this query: {query}

{documents}

Provide a comprehensive summary that addresses the key points relevant to the query.
“””
)

chain = (
{“documents”: lambda docs: format_docs(docs), “query”: lambda _: query}
| summarize_prompt
| llm
| StrOutputParser()
)

return chain.invoke(documents)

These two functions enhance the assistant’s intelligence and efficiency. The retrieve_with_fallback function implements a hybrid retrieval mechanism: it first attempts to fetch semantically relevant documents from the local Chroma cache and, if unsuccessful, falls back to a real-time Tavily web search, caching the new results for future use. Meanwhile, summarize_documents leverages a Gemini LLM to generate concise summaries from retrieved documents, guided by a structured prompt that ensures relevance to the query. Together, they enable low-latency, informative, and context-aware responses.

def advanced_chain(query_engine=”enhanced”, model=”gemini-1.5-pro”, include_history=True):
llm = get_llm(model_name=model)

if query_engine == “enhanced”:
retriever = lambda query: retrieve_with_fallback(query)
else:
retriever = enhanced_retriever.invoke

def chain_with_history(input_dict):
query = input_dict[“question”]
chat_history = memory.load_memory_variables({})[“chat_history”] if include_history else []

docs = retriever(query)

context = format_docs(docs)

result = prompt.invoke({
“context”: context,
“question”: query,
“chat_history”: chat_history
})

memory.save_context({“input”: query}, {“output”: result.content})

return llm.invoke(result)

return RunnableLambda(chain_with_history) | StrOutputParser()

The advanced_chain function defines a modular, end-to-end reasoning workflow for answering user queries using cached or real-time search. It initializes the specified Gemini model, selects the retrieval strategy (cached fallback or direct search), constructs a response pipeline incorporating chat history (if enabled), formats documents into context, and prompts the LLM using a system-guided template. The chain also logs the interaction in memory and returns the final answer, parsed into clean text. This design enables flexible experimentation with models and retrieval strategies while maintaining conversation coherence.

qa_chain = advanced_chain()

def analyze_query(query):
llm = get_llm(temperature=0)

analysis_prompt = ChatPromptTemplate.from_template(
“””Analyze the following query and provide:
1. Main topic
2. Sentiment (positive, negative, neutral)
3. Key entities mentioned
4. Query type (factual, opinion, how-to, etc.)

Query: {query}

Return the analysis in JSON format with the following structure:
{{
“topic”: “main topic”,
“sentiment”: “sentiment”,
“entities”: [“entity1”, “entity2”],
“type”: “query type”
}}
“””
)

chain = analysis_prompt | llm | output_parser

return chain.invoke({“query”: query})

print(“Advanced Tavily-Gemini Implementation”)
print(“=”*50)

query = “what year was breath of the wild released and what was its reception?”
print(f”Query: {query}”)

We initialize the final components of the intelligent assistant. qa_chain is the assembled reasoning pipeline ready to process user queries using retrieval, memory, and Gemini-based response generation. The analyze_query function performs a lightweight semantic analysis on a query, extracting the main topic, sentiment, entities, and query type using the Gemini model and a structured JSON prompt. The example query, about Breath of the Wild’s release and reception, showcases how the assistant is triggered and prepared for full-stack inference and semantic interpretation. The printed heading marks the start of interactive execution.

try:
print(“nSearching for answer…”)
answer = qa_chain.invoke({“question”: query})
print(“nAnswer:”)
print(answer)

print(“nAnalyzing query…”)
try:
query_analysis = analyze_query(query)
print(“nQuery Analysis:”)
print(json.dumps(query_analysis, indent=2))
except Exception as e:
print(f”Query analysis error (non-critical): {e}”)
except Exception as e:
print(f”Error in search: {e}”)

history = enhanced_retriever.get_search_history()
print(“nSearch History:”)
for i, h in enumerate(history):
print(f”{i+1}. Query: {h[‘query’]} – Results: {h[‘num_results’]} – Time: {h[‘response_time’]:.2f}s”)

print(“nAdvanced search with domain filtering:”)
specialized_retriever = EnhancedTavilyRetriever(
max_results=3,
search_depth=”advanced”,
include_domains=[“nintendo.com”, “zelda.com”],
exclude_domains=[“reddit.com”, “twitter.com”]
)

try:
specialized_results = specialized_retriever.invoke(“breath of the wild sales”)
print(f”Found {len(specialized_results)} specialized results”)

summary = summarize_documents(specialized_results, “breath of the wild sales”)
print(“nSummary of specialized results:”)
print(summary)
except Exception as e:
print(f”Error in specialized search: {e}”)

print(“nSearch Metrics:”)
plot_search_metrics(history)

We demonstrate the complete pipeline in action. It performs a search using the qa_chain, displays the generated answer, and then analyzes the query for sentiment, topic, entities, and type. It also retrieves and prints each query’s search history, response time, and result count. Also, it runs a domain-filtered search focused on Nintendo-related sites, summarizes the results, and visualizes search performance using plot_search_metrics, offering a comprehensive view of the assistant’s capabilities in real-time use.

In conclusion, following this tutorial gives users a comprehensive blueprint for creating a highly capable, context-aware, and scalable RAG system that bridges real-time web intelligence with conversational AI. The Tavily Search API lets users directly pull fresh and relevant content from the web. The Gemini LLM adds robust reasoning and summarization capabilities, while LangChain’s abstraction layer allows seamless orchestration between memory, embeddings, and model outputs. The implementation includes advanced features such as domain-specific filtering, query analysis (sentiment, topic, and entity extraction), and fallback strategies using a semantic vector cache built with Chroma and GoogleGenerativeAIEmbeddings. Also, structured logging, error handling, and analytics dashboards provide transparency and diagnostics for real-world deployment.

Check out the Colab Notebook. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit.

Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.

🚨 Build GenAI you can trust. ⭐️ Parlant is your open-source engine for controlled, compliant, and purposeful AI conversations — Star Parlant on GitHub! (Promoted)

Source link