AI RAG 引擎

<< 返回模板

访问代码仓库在沙盒中打开

功能特性

多模态文档处理：从 PDF 文档中提取并处理文本、图片和表格
智能分块：采用带重叠的优化文档切分方式，更好地保留上下文
混合检索：结合语义检索与关键词检索，并支持可配置权重
LLM 重排序：使用 Gemini 模型进行高级结果重排序，提升相关性
事实依据校验（Grounding Verification）：自动将回答中的论断与源材料进行对照验证
上下文管理：支持多轮对话，并进行上下文跟踪
Google ADK 集成：基于 Google Agent Development Kit 的现代 Agent 框架
云存储集成：无缝集成 GCS，用于文档管理

架构

 
 PDF Documents RAG Ingestion Vertex AI RAG 
 Pipeline Corpus 
 
 
 
 
 
 
 Google ADK 
 Agent 
 
 
 
 
 Gemini Model 
 (Flash 2.5)

环境要求

Python 3.8+
Google Cloud Platform 账号
已启用 Vertex AI API
Google Cloud Storage 存储桶（Bucket）

安装

克隆仓库：

git clone <repository-url> cd ai-rag-engine

创建并激活虚拟环境：

python -m venv venv source venv/bin/activate # Windows：venv\Scripts\activate

安装依赖：
```
pip install -r requirements.txt
```

设置环境变量：
创建一个 .env 文件，内容如下：

GOOGLE_CLOUD_PROJECT=your-project-id GOOGLE_CLOUD_LOCATION=us-central1 GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json GCS_BUCKET_NAME=your-bucket-name

使用方法

基础用法

将 PDF 文档放到 docs/ 目录中
运行 RAG agent：
```
python rag_agent.py
```

以编程方式使用

from rag_agent import ADKRAGAgent, create_rag_corpus, import_documents_to_corpus

# Initialize the system
corpus_id = create_rag_corpus(
 corpus_name="my-knowledge-base",
 description="Enterprise documentation"
)

# Upload and import documents from local docs folder
from rag_agent import upload_file_to_gcs
import os

# Upload local PDFs to GCS
document_paths = [
 "docs/technical_manual.pdf",
 "docs/product_specs.pdf",
 "docs/user_guide.pdf"
]

gcs_uris = []
for doc_path in document_paths:
 if os.path.exists(doc_path):
 gcs_uri = upload_file_to_gcs(doc_path, os.getenv('GCS_BUCKET_NAME'))
 gcs_uris.append(gcs_uri)

# Import uploaded documents to RAG corpus
import_documents_to_corpus(corpus_id, gcs_uris)

# Create and use the agent
adk_agent = ADKRAGAgent(
 corpus_id=corpus_id,
 project_id="your-project-id",
 location="us-central1"
)

agent = adk_agent.create_agent()
response = adk_agent.chat(agent, "What are the system requirements?")
print(response)

关键组件

RAGAgent

核心 RAG 功能包括：

结合语义与关键词匹配的混合检索
用于多轮对话的上下文管理
使用 LLM 打分进行结果重排序
通过事实依据校验降低幻觉（hallucination）

ADKRAGAgent

Google ADK 封装，提供：

基于工具（Tool-based）的 RAG 搜索能力
与 Gemini 模型的原生集成
自动工具调用与回答综合生成
基于会话的对话追踪

文档处理

使用 PyPDF2 进行 PDF 文本提取
使用 PyMuPDF（fitz）进行图片提取
表格内容解析与结构化
多模态向量（Embedding）生成

配置

检索参数

在 configure_retrieval_parameters() 中调整：

similarity_top_k：检索返回的结果数量（默认：10）
vector_distance_threshold：相似度阈值（默认：0.5）
alpha：混合检索权重（默认：0.5）

分块设置

在 chunk_document() 中修改：

chunk_size：每个分块的字符数（默认：1000）
overlap：分块之间的重叠量（默认：200）

模型配置

在初始化中更换模型：

RAG Agent：gemini-2.5-flash
ADK Agent：gemini-2.0-flash-001
Embedding：text-embedding-004

项目结构

ai-rag-engine/
 rag_agent.py # Main RAG system implementation
 requirements.txt # Python dependencies
 .env # Environment variables (not in git)
 docs/ # Source PDF documents
 extracted_images/ # Extracted images from PDFs
 README.md # This file

功能详解

混合检索

结合语义向量与关键词匹配，实现最佳检索效果：

results = agent.hybrid_search(
 corpus_id=corpus_id,
 query="your query",
 semantic_weight=0.7, # 70% semantic, 30% keyword
 top_k=10
)

事实依据校验（Grounding Verification）

确保回答内容在事实层面基于源文档：

verification = agent.verify_grounding(
 response=answer,
 sources=retrieved_docs
)

多模态处理

可处理包含文本、图片和表格的文档：

images = agent.extract_images_from_pdf(pdf_path, output_dir)
table_data = agent.process_table_content(table_text)
embedding = agent.create_multimodal_embedding(text, image_path, table_data)

限制

速率限制：RAG 检索操作每分钟 5 次请求
向语料库导入文档后，需要等待 3 分钟完成索引
仅支持 PDF 处理（如需可扩展到其他格式）

贡献

欢迎贡献！请遵循以下指南：

Fork 该仓库
创建功能分支
使用清晰的提交信息完成修改
提交 Pull Request

许可证

[在此添加你的许可协议信息]

支持

如遇问题或有疑问：

在 GitHub 上创建 Issue
查看 Google Vertex AI 文档：https://cloud.google.com/vertex-ai/docs

致谢

使用 Google Vertex AI RAG Engine 构建
由 Google ADK 与 Gemini 模型驱动
使用 PyPDF2 与 PyMuPDF 进行文档处理