In the legal world, professionals are often overwhelmed by a sea of documents. From multi-hundred-page contracts to dense regulatory frameworks, the ability to find specific, accurate information quickly is not just a convenience—it’s a critical necessity. AI, particularly Large Language Models (LLMs), promises a life raft, but building a truly effective legal assistant is a unique and complex challenge.
This article will guide you through the creation of a legal chatbot that goes beyond simple document retrieval. We will develop an AI-powered agent capable of understanding the intricate relationships within contracts and the broader principles of law, delivering nuanced, context-aware responses.
We’ll explore why the legal domain demands more than a basic RAG setup and provide a detailed, step-by-step implementation of a solution using Google’s Gemini Pro, LangGraph, and a Neo4j knowledge graph.
Why Legal Chatbots Are a Special Challenge
Legal questions are not monolithic. They exist on a spectrum, ranging from highly specific to broadly conceptual. This duality is what makes a one-size-fits-all AI solution ineffective.
Consider these two distinct types of queries:
- Specific, Relational Questions: These revolve around the concrete facts and relationships within a specific document or set of documents.
- “What is the governing law for the contract between ‘Company A’ and ‘Company B’?”
- “List all parties that have a contract with ‘Company C’ that is governed by the laws of New York.”
- “Which contracts have a termination clause related to ‘change of control’?”
- Broad, Conceptual Questions: These concern general legal principles, definitions, and regulations that are not tied to a single contract.
- “What makes a contract clause unconscionable?”
- “Explain the ‘Statute of Frauds’.”
- “What are the standard remedies for a breach of contract under the UCC?”
A simple vector store, the backbone of most RAG applications, is a perfect solution for the second type of question. It can find text passages that are semantically similar to the query, making it ideal for explaining broad concepts.
However, a vector store fundamentally fails at the first type. It doesn’t understand that “Company A” is a Party
or that “Delaware” is a Jurisdiction
that GOVERNS
a Contract
. It cannot perform relational queries across multiple documents because it lacks a structured understanding of the entities and their connections. To answer “show me all parties connected to contracts governed by Delaware law,” the system needs to know what a “party” is, what “Delaware law” is, and how they relate to a “contract.”
This is where our solution comes in.
Purpose and Scope: A Hybrid Intelligence Chatbot
The legal chatbot we will build is designed to master this dual nature of legal inquiry. It achieves this through a hybrid architecture that combines the strengths of two distinct data storage and retrieval methods:
- A Neo4j Knowledge Graph for Contracts: For answering specific, relational questions, we will model our contract data as a rich, interconnected knowledge graph. Entities like Contract, Party, GoverningLaw, and TerminationCondition become nodes, and their relationships (e.g., IS_PARTY_TO, GOVERNED_BY) become edges. This structure enables powerful and precise queries that can navigate intricate connections.
- A FAISS Vector Store for Laws: For answering broad, conceptual questions, we will build a vector store from a comprehensive legal text (in our case, the Uniform Commercial Code). This allows for fast and efficient semantic search to explain legal principles.
The entire system is orchestrated by an intelligent agent built with LangGraph, which can analyze a user’s question, decide which data source (or combination of sources) is needed, and synthesize the retrieved information into a single, comprehensive answer.
Detailed Implementation Walkthrough
Let’s dive into the code. This entire solution is designed to run in a Google Colab notebook.
Project Setup and Google Drive
First, we install the necessary libraries and mount Google Drive. Mounting a drive is a practical step in Colab that allows us to persist data, like our vector store index, across sessions.
# --- 1. Install all required packages ---
!pip install -U --quiet google-cloud-aiplatform langchain langchain_community langchain-google-vertexai langchain-neo4j langgraph neo4j pydantic pandas beautifulsoup4 sentence-transformers faiss-cpu
# --- 2. Mount Google Drive ---
# This connects your Colab notebook to your Google Drive. We will use this to save
# our FAISS vector store index, which contains the legal knowledge base. This makes
# it persistent, so you don't have to rebuild it every time you run the notebook.
from google.colab import drive
try:
drive.mount('/content/drive')
print("Google Drive mounted successfully.")
except Exception as e:
print(f"Error mounting Google Drive: {e}")
For the next step, you will need to:
- Set up a Google Cloud project and enable the Vertex AI API.
- Create an account and an instance in neo4j.io.
Authentication, Configuration, and Model Initialization
Here, we authenticate our Google account to use Vertex AI services, configure our project details, and initialize the core components of our AI: the Gemini LLM for generation and an embedding model for vectorization.
# --- 3. Authenticate for Google Cloud and Vertex AI ---
# This block handles authentication with your Google Cloud account, which is required
# to use the Vertex AI service and the Gemini models.
import sys
from google.colab import auth
if "google.colab" in sys.modules:
try:
# Authenticates the user in the Colab environment.
auth.authenticate_user()
print("Colab user authenticated for Google Cloud Platform.")
except Exception as e:
print(f"ERROR: Vertex AI Auth: {e}")
# --- 4. Set up GCP, Model, and Database Configurations ---
import os
from google.cloud import aiplatform
from langchain_google_vertexai import ChatVertexAI, VertexAIEmbeddings
# --- FILL IN YOUR DETAILS BELOW ---
# Your Google Cloud Project Details
GOOGLE_PROJECT_ID = "<your google project id>" # @param {type:"string"}
REGION = "<model region (most often us-east1)>" # @param {type:"string"}
GENERATION_MODEL_ID = "gemini-2.5-pro" # @param {type:"string"}
EMBEDDING_MODEL_ID = "text-embedding-005" # @param {type:"string"}
# --- Configure Remote Neo4j AuraDB Connection ---
# Replace with your own AuraDB credentials for security
NEO4J_URI = "your neo4j database link, e.g. neo4j+s://abcdef.databases.neo4j.io" # @param {type:"string"}
NEO4J_USERNAME = "usually neo4j" # @param {type:"string"}
NEO4J_PASSWORD = "your neo4j database password" # @param {type:"string"}
# --- 5. Initialize Vertex AI and Gemini LLM ---
try:
# Initialize the Vertex AI SDK
aiplatform.init(project=GOOGLE_PROJECT_ID, location=REGION)
print("Vertex AI SDK initialized successfully.")
# Initialize the Generation LLM (Gemini)
llm = ChatVertexAI(model_name=GENERATION_MODEL_ID, project=GOOGLE_PROJECT_ID, location=REGION, temperature=0.1, max_retries=2)
print(f"Generation LLM initialized successfully with model '{GENERATION_MODEL_ID}'.")
# Initialize the embedding model for our vector store
embedding_model = VertexAIEmbeddings(model_name=EMBEDDING_MODEL_ID, project=GOOGLE_PROJECT_ID)
print(f"Embedding Model initialized successfully with model '{EMBEDDING_MODEL_ID}'.")
except Exception as e:
# Exit if model initialization fails, as the rest of the script depends on it.
print(f"ERROR: Vertex AI LLM init failed: {e}")
sys.exit(1)
Loading the Contract Dataset
We use the Contract Understanding Atticus Dataset (CUAD), a rich collection of commercial contracts. For our project, we are not interested in pre-annotated clauses; instead, we load the full text of each contract. This raw text will be the input for our intelligent “Agentic Indexer.”
# Copy dataset from drive and unzip
!cp -r "/content/drive/My Drive/1Blog/llm/legal_chatbot/CUAD_v1.zip" "/content/CUAD_v1.zip"
!unzip -o -q CUAD_v1.zip
# --- 6. Load Full Contract Texts from CUAD Dataset ---
import json
import pandas as pd
with open('CUAD_v1/CUAD_v1.json') as f:
data = json.load(f)
# We will now process full contracts.
# We create a dictionary where each key is the contract title and the value is the
# concatenated text of all its paragraphs.
contracts_dict = {}
for item in data['data']:
title = item['title']
full_text = "\n\n".join([p['context'] for p in item['paragraphs']])
contracts_dict[title] = full_text
# Convert to a DataFrame for easier handling and iteration.
df_contracts = pd.DataFrame(list(contracts_dict.items()), columns=['title', 'text'])
print(f"Loaded {len(df_contracts)} full contract documents.")
The Agentic Indexer – Building the Smart Graph
This is the heart of our approach. Instead of just dumping text into a database, we use an AI agent to read and understand each contract first, creating a structured knowledge graph from its analysis.
First, we define the desired output structure using Pydantic models. This instructs the LLM on what type of information to extract and in what format.
# --- 7. Define the Data Structure for Graph Extraction ---
from typing import List, Optional
from pydantic import BaseModel, Field
class Party(BaseModel):
"""Identifies a party (e.g., a company, person) involved in the contract."""
name: str = Field(description="The full, official name of the party.")
role: str = Field(description="The role of the party, such as 'Buyer', 'Seller', 'Licensor'.")
class GoverningLaw(BaseModel):
"""Identifies the governing law and its jurisdiction."""
jurisdiction: str = Field(description="The state or country whose laws govern the contract.")
class TerminationCondition(BaseModel):
"""Describes a specific condition under which the contract can be terminated."""
description: str = Field(description="A brief summary of the termination clause.")
class ContractGraph(BaseModel):
"""A comprehensive, structured representation of a legal contract's key entities."""
parties: List[Party]
governing_law: Optional[GoverningLaw]
effective_date: Optional[str]
termination_conditions: List[TerminationCondition]
summary: str = Field(description="A concise, one-paragraph summary of the contract's purpose.")
Next, we create the Indexer Agent itself. We combine our Gemini model with a prompt that instructs it to act as a legal expert and fill out our ContractGraph
Pydantic model.
# --- 8. Create the Agentic Indexer Chain ---
from langchain_core.prompts import ChatPromptTemplate
# Create an LLM instance specifically for structured output
structured_llm = llm.with_structured_output(ContractGraph)
# This prompt instructs the LLM on its role
prompt = ChatPromptTemplate.from_messages([
("system", """
You are an expert legal AI agent specializing in commercial contracts.
Your task is to meticulously analyze a contract text and extract key information into a structured format.
"""),
("human", "Please analyze this contract text:\n\n```\n{contract_text}\n```")
])
indexer_agent = prompt | structured_llm
Finally, we loop through our contracts, run the agent on each one, and use the structured output to populate our Neo4j AuraDB with Cypher queries.
# --- 9. Populate the Remote Neo4j AuraDB with the Rich Graph ---
from langchain.graphs import Neo4jGraph
import time
try:
# Establish a connection to the Neo4j database.
graph = Neo4jGraph(url=NEO4J_URI, username=NEO4J_USERNAME, password=NEO4J_PASSWORD)
# Clear the database for a clean run. This is useful for development.
graph.query("MATCH (n) DETACH DELETE n;")
print("Cleared existing graph data from Neo4j AuraDB.")
# We can process a small subset for this demonstration (e.g., the first 30 contracts).
# Processing all 510 contracts can take a significant amount of time and may incur API costs.
# for index, row in df_contracts.head(30).iterrows():
for index, row in df_contracts.iterrows():
print(f"Processing contract ({index+1}/{len(df_contracts)}): {row['title']}...")
try:
# Invoke the agent to analyze the contract and get the structured data
structured_data = indexer_agent.invoke({"contract_text": row['text']})
print("Contract structured_data: ", structured_data)
# Begin populating the graph with the extracted data
# First, create the central Contract node with its summary and effective date
graph.query("""
MERGE (c:Contract {title: $title})
SET c.summary = $summary, c.effective_date = $effective_date
""", params={
'title': row['title'],
'summary': structured_data.summary,
'effective_date': structured_data.effective_date
})
# Create and link Party nodes to the Contract
for party in structured_data.parties:
graph.query("""
MATCH (c:Contract {title: $title})
MERGE (p:Party {name: $party_name})
SET p.role_in_contract = $role
MERGE (p)-[:IS_PARTY_TO]->(c)
""", params={'title': row['title'], 'party_name': party.name, 'role': party.role})
# Create and link the GoverningLaw node
if structured_data.governing_law:
graph.query("""
MATCH (c:Contract {title: $title})
MERGE (gl:GoverningLaw {jurisdiction: $jurisdiction})
MERGE (c)-[:GOVERNED_BY]->(gl)
""", params={'title': row['title'], 'jurisdiction': structured_data.governing_law.jurisdiction})
# Create and link TerminationCondition nodes
for term in structured_data.termination_conditions:
graph.query("""
MATCH (c:Contract {title: $title})
MERGE (tc:TerminationCondition {description: $description})
MERGE (c)-[:HAS_TERMINATION_CONDITION]->(tc)
""", params={'title': row['title'], 'description': term.description})
print(f" Successfully indexed and stored '{row['title']}' in the graph.")
# We add a small delay to avoid hitting any potential LLM API rate limits.
time.sleep(3)
except Exception as e:
print(f" Failed to process contract '{row['title']}'. Error: {e}")
# After populating, we refresh the graph schema so the QA chain knows about the new structure.
graph.refresh_schema()
print("\nSuccessfully populated Neo4j AuraDB with a rich, interconnected graph.")
print("Refreshed graph schema.")
print("New Graph Schema:\n", graph.schema)
except Exception as e:
print(f"ERROR: Failed to connect or populate Neo4j. Please check your credentials and AuraDB status. {e}")
Building the Legal Knowledge Base
With our contract graph built, we turn to the second pillar of our system: the legal knowledge base. We utilize a web scraper to extract the full text of UCC Article 2 from the Cornell Law website. The scraper identifies all sub-page links on the main index and visits each one to create a comprehensive text corpus. This text is then chunked, vectorized using our embedding model, and stored in a FAISS vector store that we persist to Google Drive.
# --- 10. Scrape Legal Data, Create and Persist Vector Store ---
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import time
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain.docstore.document import Document
def _parse_individual_law_page(url: str) -> List[dict]:
"""
Helper function to scrape the content from a single law section page.
Args:
url: The URL of the individual law section page.
Returns:
A list of dictionaries, each containing the source and content of a text block.
"""
try:
response = requests.get(url)
response.raise_for_status()
soup = BeautifulSoup(response.content, 'html.parser')
texts = []
# The main content of these sub-pages is typically within this div
content_div = soup.find('div', id='extracted-content')
if not content_div:
content_div = soup.find('div', id='main-content') # Fallback selector
if content_div:
# Extract the main title of the law section
page_title_tag = content_div.find('h1', class_='title')
page_title = page_title_tag.get_text(strip=True) if page_title_tag else "UCC Section"
# Find all relevant tags that contain the legal text. We look for paragraphs
# and divs with a 'content' id.
sections = content_div.find_all(['p', 'div'], id=['content'])
for tag in sections:
text = tag.get_text(strip=True)
if text: # Ensure we don't add empty strings
texts.append({'source': f"{page_title} ({url})", 'content': text})
return texts
except requests.exceptions.RequestException as e:
print(f" Error fetching sub-page {url}: {e}")
return []
def scrape_ucc_article_2(base_url: str) -> List[dict]:
"""
Scrapes the entire UCC Article 2 from the Cornell Law website by finding all
section links on the main page and then visiting each one.
Args:
base_url: The URL of the main index page for UCC Article 2.
Returns:
A list of dictionaries containing all scraped legal text.
"""
print(f"Starting scrape of main index page: {base_url}")
try:
response = requests.get(base_url)
response.raise_for_status()
soup = BeautifulSoup(response.content, 'html.parser')
all_legal_texts = []
links_to_scrape = []
# Find the main content area and then all `<a>` tags within list items.
# This is a good way to find the links to individual sections.
main_content = soup.find('div', id='main-content')
if main_content:
for a_tag in main_content.select('ul li a'):
href = a_tag.get('href')
if href and href.startswith('/ucc/2/'):
# Construct the full, absolute URL and add to our list
full_url = urljoin(base_url, href)
if full_url not in links_to_scrape:
links_to_scrape.append(full_url)
print(f"Found {len(links_to_scrape)} individual law section links to scrape.")
# Now, iterate through each link and scrape its content
for i, link in enumerate(links_to_scrape):
print(f" Scraping page {i+1}/{len(links_to_scrape)}: {link}")
# Call the helper function to get the text from the sub-page
page_texts = _parse_individual_law_page(link)
if page_texts:
all_legal_texts.extend(page_texts)
# Be a good person: wait a moment between requests to avoid overwhelming the server.
time.sleep(1)
return all_legal_texts
except requests.exceptions.RequestException as e:
print(f"FATAL: Could not fetch the main index page {base_url}. Error: {e}")
return []
# --- Execute the new scraper ---
ucc_base_url = "https://www.law.cornell.edu/ucc/2"
legal_data = scrape_ucc_article_2(ucc_base_url)
print(f"\nScraping complete. Total legal text blocks gathered: {len(legal_data)}")
# Define the path in your Google Drive where the FAISS index will be saved.
DRIVE_FAISS_PATH = "/content/drive/<path to your folder>"
# Convert our scraped dictionary into LangChain Document objects for processing.
docs = [Document(page_content=item['content'], metadata={'source': item['source']}) for item in legal_data]
# Split documents into smaller, semantically meaningful chunks for better retrieval.
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1200, chunk_overlap=200)
chunked_docs = text_splitter.split_documents(docs)
# Create the vector store using our Gemini embedding model and the chunked documents.
vectorstore = FAISS.from_documents(chunked_docs, embedding_model)
# Create the directory in Google Drive if it doesn't exist.
os.makedirs(os.path.dirname(DRIVE_FAISS_PATH), exist_ok=True)
# Save the vector store to the specified path.
vectorstore.save_local(DRIVE_FAISS_PATH)
print(f"Legal vector store has been saved to Google Drive at: '{DRIVE_FAISS_PATH}'")
# Load the retriever from the persisted index on Google Drive.
loaded_vectorstore = FAISS.load_local(DRIVE_FAISS_PATH, embedding_model, allow_dangerous_deserialization=True)
legal_retriever = loaded_vectorstore.as_retriever(search_kwargs={'k': 20}) # Retrieve top 10 relevant chunks
print("Successfully loaded legal retriever from Google Drive.")
The Querying Agent with LangGraph
Now we build the user-facing agent that orchestrates everything. We use LangGraph to define a stateful workflow.
First, we define the AgentState
(core workflow storage or memory). We add a route_decision
field to remember the router’s initial choice, which fixes a potential logic flaw and allows for true conditional branching.
# --- 11. Define the Querying Agent Graph ---
from typing import List, TypedDict
from langchain_core.messages import BaseMessage, HumanMessage
from langgraph.graph import StateGraph, END
from langchain_neo4j.chains import GraphCypherQAChain
class AgentState(TypedDict):
messages: List[BaseMessage]
contract_info: str
legal_info: str
generation: str
route_decision: str
Next, we define our nodes—the functions that represent the agent’s possible actions. This includes the route_question
node and nodes for retrieving contract info, legal info, and generating the final response.
# --- Agent Nodes ---
# Node 1: Router
# This node uses the LLM to decide the best path and saves that decision to the state.
def route_question(state: AgentState) -> dict:
"""
Routes the user's question to the appropriate tool or combination of tools
and updates the 'route_decision' in the agent's state.
"""
last_message = state['messages'][-1].content
prompt = ChatPromptTemplate.from_template("""
You are an expert legal question router. Analyze the user's query and determine if it requires:
1. Retrieving specific data from a contract graph (e.g., "Who are the parties?", "What is the governing law?"). Respond with 'contract_only'.
2. Explaining a general legal concept or law from a text database (e.g., "What is a 'statute of frauds'?"). Respond with 'legal_only'.
3. A combination of both, where a contract detail needs a legal explanation. Respond with 'both'.
Respond with only one word: 'contract_only', 'legal_only', or 'both'.
User query: "{question}"
""")
router_chain = prompt | llm
decision = router_chain.invoke({"question": last_message}).content.strip().lower()
# The router now returns a dictionary to update the state, which is the standard
# and most robust pattern in LangGraph.
if "contract_only" in decision:
print(f"Router Decision: 'contract_only'.")
return {"route_decision": "contract_only"}
elif "legal_only" in decision:
print(f"Router Decision: 'legal_only'.")
return {"route_decision": "legal_only"}
else: # Default to 'both' if the model is ambiguous.
print(f"Router Decision: 'both'.")
return {"route_decision": "both"}
# Node 2a: Contract Retriever
contract_qa_chain = GraphCypherQAChain.from_llm(
graph=graph,
cypher_llm=llm,
qa_llm=llm,
verbose=True,
validate_cypher_queries=True, # Helps prevent hallucinated/invalid Cypher queries.
allow_dangerous_requests=True
)
def get_contract_info(state: AgentState) -> dict:
"""Queries the Neo4j graph for contract-related information."""
question = state['messages'][-1].content
print("\n--- EXECUTING CONTRACT QUERY ---")
response = contract_qa_chain.invoke({"query": question})
return {"contract_info": response['result']}
# Node 2b: Legal Retriever
def get_legal_info(state: AgentState) -> dict:
"""Performs a similarity search on the legal vector store."""
question = state['messages'][-1].content
print("\n--- EXECUTING LEGAL KNOWLEDGE QUERY ---")
docs = legal_retriever.invoke(question)
legal_context = "\n\n".join([f"Source: {doc.metadata.get('source', 'N/A')}\nContent: {doc.page_content}" for doc in docs])
return {"legal_info": legal_context}
# Node 3: Final Response Generator
# This node synthesizes all the retrieved data into a single, cohesive answer.
def generate_response(state: AgentState) -> dict:
"""Generates the final answer by synthesizing retrieved contract and legal info."""
print("\n--- GENERATING FINAL RESPONSE ---")
question = state['messages'][-1].content
contract_info = state.get('contract_info', "No specific information was retrieved from the contract database.")
legal_info = state.get('legal_info', "No specific information was retrieved from the legal database.")
prompt = ChatPromptTemplate.from_template("""
You are a world-class legal assistant AI. Your task is to provide a clear, accurate, and comprehensive answer to the user's question by synthesizing the context provided below.
**User's Question:**
{question}
**Follow these instructions for your response:**
1. Begin by directly addressing the user's question.
2. If information from the contract database is available and relevant, present those specific findings first.
3. Next, if legal context from the knowledge base is available and relevant, use it to explain the broader legal principles that relate to the user's question or the contract findings.
4. Structure your response for maximum clarity. Use paragraphs, bullet points, and bold text to organize the information.
5. Cite your sources for sentences that use retrieved information by adding a number at the end, like this. [1] for contract info or [2] for legal info.
---
**CONTEXT FROM CONTRACT DATABASE:**
{contract_info} [1]
**CONTEXT FROM LEGAL KNOWLEDGE BASE:**
{legal_info} [2]
---
**Your Comprehensive Answer:**
""")
generation_chain = prompt | llm
response_content = generation_chain.invoke({
"question": question,
"contract_info": contract_info,
"legal_info": legal_info
}).content
return {"generation": response_content}
# --- Flow Edges ---
# After the router runs, this conditional edge decides the first step based on
# the 'route_decision' that was saved in the state.
def decide_first_step(state: AgentState):
return state["route_decision"]
# Decides whether to get legal info next or to finish and generate a response.
# It checks the decision that was saved in the state by the initial router.
def decide_after_contract(state: AgentState):
if state["route_decision"] == "both":
# If the original plan was to get both, proceed to the legal info node.
return "get_legal_info"
else:
# If the original plan was for contract info only, jump straight to the response generator.
return "generate_response"
With the nodes defined, we assemble them into a graph. This is where the power of LangGraph shines. We define a clear, logical flow with conditional branching, ensuring the agent takes the most efficient path to answer the user’s query.
# --- 12. Assemble and Run the LangGraph Application ---
# This section has been completely rewritten to be robust and logically correct.
workflow = StateGraph(AgentState)
# 1. Add all nodes to the graph. Each function is a node.
workflow.add_node("route_question", route_question)
workflow.add_node("get_contract_info", get_contract_info)
workflow.add_node("get_legal_info", get_legal_info)
workflow.add_node("generate_response", generate_response)
# 2. Set the entry point of the graph. The process starts at the router.
workflow.set_entry_point("route_question")
# 3. Define the conditional routing.
workflow.add_conditional_edges(
"route_question",
decide_first_step,
{
"contract_only": "get_contract_info",
"legal_only": "get_legal_info",
"both": "get_contract_info"
}
)
# 4. Define the paths for "legal_only" and the end of all paths.
# If we only need legal info, we get it and then immediately generate the response.
workflow.add_edge("get_legal_info", "generate_response")
# After the response is generated, the process ends.
workflow.add_edge("generate_response", END)
# 5. Define the conditional path AFTER contract info is retrieved.
workflow.add_conditional_edges(
"get_contract_info",
decide_after_contract,
{
"get_legal_info": "get_legal_info",
"generate_response": "generate_response"
}
)
# --- 13. Compile the graph into a runnable application ---
app = workflow.compile()
print("\nLangGraph application compiled successfully with corrected logic.")
# --- 14. Interact with the Chatbot ---
# This helper function simplifies running queries against our compiled agent.
def run_chatbot(question: str):
"""Helper function to run a query and print the final response."""
inputs = {"messages": [HumanMessage(content=question)]}
print(f"\n --- Running Query: '{question}' --- ")
# The invoke method runs the graph from start to finish with the given input.
final_state = app.invoke(inputs)
print("\n--- FINAL RESPONSE --- ")
print(final_state['generation'])
Putting the Legal Agent to the Test
To assess the capabilities of our intelligent legal chatbot, I’ve created a series of questions that span the full spectrum of legal inquiries, from simple fact retrieval and broad conceptual questions to complex, multi-hop relational queries that combine data from multiple sources.
The goal was to verify that the agent could not only answer correctly but also demonstrate why its hybrid architecture, combining a knowledge graph and a vector store, is superior to a monolithic approach.
Simple Fact Retrieval from a Contract
First, we tested the agent’s ability to perform a basic lookup on a single document. This establishes a baseline for retrieval accuracy.
The Question:
“Provide a summary for the ‘INTRICONCORP_03_10_2009-EX-10.22-Strategic Alliance Agreement’ contract.”
Analysis of the Result:
--- Running Query: 'Provide a summary for the 'INTRICONCORP_03_10_2009-EX-10.22-Strategic Alliance Agreement' contract.' ---
Router Decision: 'contract_only'.
--- EXECUTING CONTRACT QUERY ---
> Entering new GraphCypherQAChain chain...
Generated Cypher:
MATCH (c:Contract {title: 'INTRICONCORP_03_10_2009-EX-10.22-Strategic Alliance Agreement'}) RETURN c.summary
Full Context:
[{'c.summary': 'This is a strategic alliance agreement between IntriCon Corporation and Dynamic Hearing Pty Ltd, effective October 1, 2008. Dynamic Hearing grants IntriCon a license to use its technology, software, and documentation to manufacture, sell, and import products in the Hearing Health, Assisted Listening Device, and Professional Communications markets worldwide. The agreement outlines the terms of the license, including exclusivity conditions, payment structures involving access fees and royalties, and mutual obligations for technology support and compliance.'}]
> Finished chain.
--- GENERATING FINAL RESPONSE ---
--- FINAL RESPONSE ---
Here is a summary of the Strategic Alliance Agreement between IntriCon Corporation and Dynamic Hearing Pty Ltd.
This Strategic Alliance Agreement, effective October 1, 2008, establishes a partnership between IntriCon Corporation ("IntriCon") and Dynamic Hearing Pty Ltd ("Dynamic Hearing"). [1] The central purpose of the agreement is for Dynamic Hearing to grant a license to IntriCon for the use of its proprietary technology, software, and documentation. [1]
Key provisions of the agreement include:
* **License Grant:** IntriCon receives a license to use Dynamic Hearing's technology to manufacture, sell, and import products. [1]
* **Scope and Territory:** The license applies to the Hearing Health, Assisted Listening Device, and Professional Communications markets on a worldwide basis. [1]
* **Financial Terms:** The payment structure requires IntriCon to pay Dynamic Hearing both access fees and ongoing royalties. [1]
* **Exclusivity:** The agreement contains specific conditions regarding exclusivity for the licensed technology. [1]
* **Mutual Obligations:** Both parties are bound by mutual obligations, including provisions for technology support and compliance with the terms of the agreement. [1]
This first test was a clear success. The agent correctly performed the following steps:
- Accurate Routing: The
Router Decision: 'contract_only'
shows that the agent correctly identified the query as being specific to the contract database, bypassing the legal vector store entirely. - Efficient Query: The generated Cypher query is direct and highly efficient. It did not need to perform a full-text search; instead, it directly accessed the
summary
property of theContract
node. This is a direct benefit of our Agentic Indexing process, where the summary was created and stored during data ingestion. - Synthesized Response: The final answer isn’t just a raw data dump. The LLM took the structured summary and presented it in a clean, human-readable format, complete with a clear topic sentence and bullet points.
Relational Query Across Multiple Contracts
This test was designed to probe the core strength of our knowledge graph—its ability to answer questions that require understanding relationships between different entities.
The Question:
“Which parties are involved in contracts that are governed by the laws of New York?”
Analysis of the Result:
--- Running Query: 'Which parties are involved in contracts that are governed by the laws of New York?' ---
Router Decision: 'contract_only'.
--- EXECUTING CONTRACT QUERY ---
> Entering new GraphCypherQAChain chain...
Generated Cypher:
MATCH (p:Party)-[:IS_PARTY_TO]->(:Contract)-[:GOVERNED_BY]->(gl:GoverningLaw)
WHERE gl.jurisdiction = 'New York'
RETURN p.name
Full Context:
[{'p.name': 'PACIRA PHARMACEUTICALS, INC.'}, {'p.name': 'EKR THERAPEUTICS, INC.'}, {'p.name': 'American International Group, Inc.'}, {'p.name': 'American General Life Insurance Company of Delaware'}, {'p.name': 'Synchron'}, {'p.name': 'Rare Element Resources Ltd.'}, {'p.name': 'American International Group, Inc.'}, {'p.name': 'American General Life Insurance Company'}, {'p.name': 'YourNetPlus.com, Inc.'}, {'p.name': 'Kingdom Connect, Inc.'}]
> Finished chain.
--- GENERATING FINAL RESPONSE ---
--- FINAL RESPONSE ---
Based on the provided contract data, the following parties are involved in contracts governed by the laws of New York [1]:
* PACIRA PHARMACEUTICALS, INC.
* EKR THERAPEUTICS, INC.
* American International Group, Inc.
* American General Life Insurance Company of Delaware
* Synchron
* Rare Element Resources Ltd.
* American General Life Insurance Company
* YourNetPlus.com, Inc.
* Kingdom Connect, Inc.
This result perfectly illustrates the importance of a knowledge graph.
- Complex Query Generation: The generated Cypher query is excellent. The agent understood the natural language question and translated it into a graph traversal path:
Find a Party node (p) that IS_PARTY_TO a Contract, which in turn is GOVERNED_BY a GoverningLaw node (gl) where the jurisdiction is 'New York'
. - Deep Relational Insight: This type of relational query is virtually impossible for a standard vector-search RAG system. A vector store lacks the structured connections to understand that a “party” and a “governing law” are linked through a specific contract.
- Clear Presentation: The agent took the list of names returned from the database and formatted it into a clean, easy-to-read list for the user.
Broad Legal Concept Explanation
Here, we pivot to test the other side of our hybrid system: the legal knowledge base powered by our FAISS vector store.
The Question:
“What does the Uniform Commercial Code say about a ‘Substituted Performance’?”
Analysis of the Result:
--- Running Query: 'What does the Uniform Commercial Code say about a 'Substituted Performance'?' ---
Router Decision: 'legal_only'.
--- EXECUTING LEGAL KNOWLEDGE QUERY ---
--- GENERATING FINAL RESPONSE ---
--- FINAL RESPONSE ---
Based on the provided legal context, here is what the Uniform Commercial Code (UCC) says about 'Substituted Performance'.
Under the Uniform Commercial Code, **Substituted Performance** is a rule that applies when an agreed-upon method of delivery or payment becomes unavailable or commercially impracticable for reasons outside the control of either party. [2] In such cases, if a reasonable alternative exists, the parties are required to use that alternative to fulfill the contract rather than terminating it. [2]
### Contract Database Findings
* No specific information regarding 'Substituted Performance' was retrieved from the contract database. [1]
### Legal Principles from the Uniform Commercial Code
The primary section governing this topic is **UCC § 2-614. Substituted Performance**. This section outlines the requirements for both delivery and payment. [2]
#### **Substituted Delivery**
When the agreed-upon method of delivery becomes unavailable through no fault of either party, a substitute must be used if available. [2] Specifically:
* If the agreed berthing, loading, or unloading facilities fail, or an agreed type of carrier becomes unavailable, or the delivery method is otherwise commercially impracticable, the seller **must tender** a "commercially reasonable substitute." [2]
* The buyer **must accept** this commercially reasonable substitute performance. [2]
The key principle is that a contract should not fail simply because the exact logistical arrangements become impossible, so long as a reasonable commercial alternative exists. [2]
#### **Substituted Payment**
The UCC also addresses situations where the agreed-upon method of payment fails due to government regulation:
* If a payment method fails because of a new domestic or foreign regulation, the seller may withhold or stop delivery until the buyer provides a means of payment that is a "commercially a substantial equivalent." [2]
* If the buyer has already received the goods, their obligation to pay is discharged by paying in the manner specified by the new regulation, unless that regulation is "discriminatory, oppressive or predatory." [2]
#### **Relationship to Excuse by Failure of Presupposed Conditions**
The concept of Substituted Performance is directly linked to, and takes precedence over, the doctrine of commercial impracticability.
* **UCC § 2-615 (Excuse by Failure of Presupposed Conditions)** allows a seller to be excused from performance if it has been made impracticable by an unforeseen contingency. [2]
* However, this rule is explicitly "subject to the preceding section on substituted performance." [2]
* This means a party cannot claim their performance is excused under § 2-615 if a commercially reasonable substitute for the failed delivery or payment method is available under § 2-614. The party must first attempt to use the substitute performance. [2]
This demonstrates the effectiveness of our legal RAG component.
- Correct Routing to Legal: The agent correctly determined this was a conceptual question and routed it to the
legal_only
path, querying the vector store built from the Cornell Law website. - Synthesis from Multiple Sources: The final answer is not just a copy-paste of a single retrieved document chunk. It’s a well-structured explanation synthesized from multiple relevant sections of the UCC scraped by our tool. It correctly defines the concept, identifies the governing section (§ 2-614), and breaks down the topic into logical sub-sections for delivery and payment. This shows a high level of comprehension and synthesis.
Combined Queries and Complex Connection Analysis
These final tests push the agent to its limits, requiring it to use both knowledge sources and handle complex, multi-hop questions.
The “Both” Question:
“What is the governing law for the ‘HERTZGLOBALHOLDINGS,INC_07_07_2016-EX-10.4-INTELLECTUAL PROPERTY AGREEMENT’ and what does the law say about INTELLECTUAL PROPERTY contracts?”
The Complex Connection Question:
“What parties company DEPOMED, INC. is connected to? Is that party connected to other parties apart from DEPOMED, INC.? Which law is governing their connection? And what is their contract about?”
Analysis of the Results:
For the first “both” query, the agent correctly identified that it needed both tools. It first queried the graph to find the governing law (“State of New York”) and then queried the vector store for “intellectual property.” It intelligently noted that while the UCC was available, it primarily covers “goods” and is not the main body of law for intangible IP, demonstrating a nuanced understanding.
--- Running Query: 'What is the governing law for the 'HERTZGLOBALHOLDINGS,INC_07_07_2016-EX-10.4-INTELLECTUAL PROPERTY AGREEMENT' and what does the law say about INTELLECTUAL PROPERTY contracts?' ---
Router Decision: 'both'.
--- EXECUTING CONTRACT QUERY ---
> Entering new GraphCypherQAChain chain...
Generated Cypher:
cypher
MATCH (c:Contract {title: 'HERTZGLOBALHOLDINGS,INC_07_07_2016-EX-10.4-INTELLECTUAL PROPERTY AGREEMENT'})-[:GOVERNED_BY]->(gl:GoverningLaw)
RETURN gl.jurisdiction
Full Context:
[{'gl.jurisdiction': 'State of New York'}]
> Finished chain.
--- EXECUTING LEGAL KNOWLEDGE QUERY ---
--- GENERATING FINAL RESPONSE ---
--- FINAL RESPONSE ---
Based on the provided information, here is a summary of the governing law for the specified agreement and the relevant legal context.
The governing law for the 'HERTZGLOBALHOLDINGS,INC_07_07_2016-EX-10.4-INTELLECTUAL PROPERTY AGREEMENT' is the **State of New York**. [1]
While the provided legal information does not specifically address New York's laws on intellectual property contracts, it does offer context on a body of law called the Uniform Commercial Code (UCC), which has been adopted by New York. However, the UCC's applicability to intellectual property agreements is limited.
* **Scope of the Uniform Commercial Code (UCC):** The provided legal texts are from Article 2 of the UCC, which applies to transactions in "goods". [2]
* **Definition of "Goods":** The UCC defines "goods" as all things that are movable at the time of identification to the contract for sale. [2]
* **Applicability to Intellectual Property:** Intellectual property (such as patents, trademarks, and copyrights) is generally considered intangible property, not movable "goods." Therefore, contracts that are purely for the sale or license of intellectual property are typically governed by state common law of contracts and federal intellectual property law, rather than by UCC Article 2.
In summary, while the contract is governed by New York law, the provided sections of the Uniform Commercial Code concerning the sale of goods would not be the primary legal framework for interpreting this intellectual property agreement. [1, 2] The agreement would instead be interpreted under New York's common law principles for contracts.
The second, more complex query is where the agent truly shines.
--- Running Query: 'What parties company DEPOMED, INC. is connected to? Is that party connected to other parties apart from DEPOMED, INC.?
Which law is governing their connection? And what is their contract about?' ---
Router Decision: 'contract_only'.
--- EXECUTING CONTRACT QUERY ---
> Entering new GraphCypherQAChain chain...
Generated Cypher:
cypher
MATCH (p1:Party {name: 'DEPOMED, INC.'})-[:IS_PARTY_TO]->(c:Contract)<-[:IS_PARTY_TO]-(p2:Party),
(c)-[:GOVERNED_BY]->(gl:GoverningLaw)
WHERE p1 <> p2
WITH p2, c, gl
OPTIONAL MATCH (p2)-[:IS_PARTY_TO]->(other_c:Contract)<-[:IS_PARTY_TO]-(other_p:Party)
WHERE other_p.name <> 'DEPOMED, INC.' AND p2 <> other_p
RETURN p2.name AS ConnectedParty, COLLECT(DISTINCT other_p.name) AS OtherConnectionsOfConnectedParty, gl.jurisdiction AS GoverningLaw, c.title AS ContractTitle, c.summary AS ContractSummary
Full Context:
[{'ConnectedParty': 'KING PHARMACEUTICALS, INC.', 'OtherConnectionsOfConnectedParty': [], 'GoverningLaw': 'State of New York', 'ContractTitle': 'KINGPHARMACEUTICALSINC_08_09_2006-EX-10.1-PROMOTION AGREEMENT', 'ContractSummary': 'This is a Promotion Agreement between Depomed, Inc. and King Pharmaceuticals, Inc. for the co-exclusive promotion and marketing of the Product, a once-daily oral tablet formulation containing metformin, within the United States. The agreement outlines the responsibilities of both parties concerning promotional activities, regulatory compliance, manufacturing, and financial compensation.'}]
> Finished chain.
--- GENERATING FINAL RESPONSE ---
--- FINAL RESPONSE ---
Based on the information provided, here is a summary of the connections for DEPOMED, INC.
**Party Connections and Contract Details**
According to the contract database, DEPOMED, INC. is connected to **KING PHARMACEUTICALS, INC.** [1] There is no information available regarding any other parties connected to KING PHARMACEUTICALS, INC. [1]
* **Governing Law:** The connection between DEPOMED, INC. and KING PHARMACEUTICALS, INC. is governed by the law of the **State of New York**. [1]
* **Contract Subject:** The contract between the two companies is a **Promotion Agreement**. [1] This agreement establishes a co-exclusive arrangement for the promotion and marketing of a specific pharmaceutical product—a once-daily oral tablet containing metformin—within the United States. [1] The contract details the duties of each party regarding promotional activities, regulatory compliance, manufacturing, and financial terms. [1]
The Cypher query generated here is incredibly sophisticated. The agent correctly interpreted the multi-part natural language question and translated it into a graph algorithm:
- It finds DEPOMED, INC. and the other party (
p2
) in a contract. - It uses
OPTIONAL MATCH
to look for other contractsp2
that might be in a “second-hop” connection. - It uses
COLLECT(DISTINCT ...)
to aggregate these second-hop connections cleanly. - Finally, it brings all the requested information—the connected party, their other connections, the governing law, and the contract summary—into a single, coherent response.
This showcases the ultimate benefit of our architecture: an AI that not only finds text but also understands and navigates the complex web of relationships within legal documents.
Summary and Conclusion
Traditional RAG systems, while helpful, often fall short in specialized domains like law, where queries require both conceptual understanding and precise, relational data retrieval. By recognizing this dual need, we have built a far more capable and intelligent legal assistant.
Our solution successfully addresses this challenge by implementing a hybrid architecture:
- employed an “Agentic Indexing” approach, where a Gemini-powered agent intelligently parses contracts to construct a rich Neo4j knowledge graph. This solves the problem of answering specific, relational questions.
- We built a comprehensive vector store from scraped legal code to handle broad, conceptual questions about the law.
- We used LangGraph to create a sophisticated, stateful agent that can expertly route user queries to the appropriate knowledge source—or combine them—to generate a single, cohesive, and highly accurate answer.
This approach transforms the AI from a simple search tool into a genuine analytical partner, empowering legal professionals to find the information they need faster and more intuitively than ever before.
Future Enhancements
This powerful foundation can be extended in many exciting ways:
- Proactive Clause Analysis: The agent could be enhanced to proactively flag non-standard, risky, or ambiguous clauses during the indexing phase.
- Multi-Contract Comparison: A new agent ability could be added to compare terms across multiple contracts, such as “Compare the liability clauses in contract A and contract B.”
- Expanded Legal Knowledge: The vector store could be enriched with more regulations, case law summaries, and legal precedents to broaden its expertise.
- Historical Versioning: The knowledge graph could be updated to include timestamps on relationships, allowing the agent to analyze contract amendments and track changes over time.
- And much more …