RAGFlow框架使用指南

RAGFlow框架：构建智能知识问答系统的完整解决方案

什么是RAGFlow？

RAGFlow是一个开源的RAG（Retrieval-Augmented Generation，检索增强生成）引擎，专注于基于深度文档理解构建高质量的问答系统。它提供了从文档解析到智能对话的完整工作流，是企业构建知识库问答系统的理想选择。

核心特性

1. 智能文档解析

多格式支持：PDF、Word、PowerPoint、Excel、图片等
版面分析：自动识别文档结构和版面布局
表格识别：准确提取表格数据
OCR集成：图片中文字的识别和提取

2. 先进的文本分割

语义分割：基于文档语义的智能分块
结构化处理：保持文档的逻辑结构
多策略组合：支持多种分割策略的灵活配置

3. 高效检索系统

混合检索：结合向量检索和关键词检索
重排序优化：使用深度学习模型优化检索结果
多模态支持：支持文本、图像等多种模态的检索

4. 企业级特性

可视化界面：友好的Web管理界面
API接口：完整的RESTful API
权限管理：细粒度的用户权限控制
多租户支持：支持多个独立的知识库

系统架构

RAGFlow采用模块化架构设计，主要包含以下核心组件：

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   前端界面      │    │   API网关       │    │   管理后台      │
│   Web UI        │    │   API Gateway   │    │   Admin Panel   │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         │                       │                       │
         └───────────────────────┼───────────────────────┘
                                 │
                                 ▼
         ┌─────────────────────────────────────────────────────┐
         │                RAGFlow 核心引擎                     │
         │  ┌─────────────┬─────────────┬─────────────────────┐ │
         │  │ 文档解析器  │ 文本分割器  │    向量化模块       │ │
         │  │ Parser      │ Splitter    │    Embedding        │ │
         │  └─────────────┴─────────────┴─────────────────────┘ │
         │  ┌─────────────┬─────────────┬─────────────────────┐ │
         │  │ 检索引擎    │ 对话管理    │    知识库管理       │ │
         │  │ Retriever   │ Chat Mgr    │    KB Manager       │ │
         │  └─────────────┴─────────────┴─────────────────────┘ │
         └─────────────────────────────────────────────────────┘
                                 │
         ┌───────────────────────┼───────────────────────┐
         │                       │                       │
         ▼                       ▼                       ▼
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   向量数据库    │    │   关系数据库    │    │   大语言模型    │
│   Vector Store  │    │   SQL Database  │    │   LLM Service   │
└─────────────────┘    └─────────────────┘    └─────────────────┘

快速开始

1. 环境准备

# 系统要求
- Python 3.8+
- Docker 20.10+
- 内存：8GB以上
- 磁盘空间：20GB以上

2. Docker部署（推荐）

# 克隆项目
git clone https://github.com/infiniflow/ragflow.git
cd ragflow

# 编辑配置文件
cp docker/.env.example docker/.env
# 根据需要修改配置

# 启动所有服务
docker-compose -f docker/docker-compose.yml up -d

# 查看服务状态
docker-compose -f docker/docker-compose.yml ps

3. 配置说明

# docker/.env文件配置示例

# 基础配置
RAGFLOW_VERSION=latest
HTTP_PORT=80
HTTPS_PORT=443

# 数据库配置
MYSQL_PASSWORD=infiniflow123
REDIS_PASSWORD=infiniflow123

# 模型配置
DEFAULT_LLM_FACTORY=OpenAI
OPENAI_API_KEY=your_openai_api_key
OPENAI_BASE_URL=https://api.openai.com/v1

# Embedding模型配置
DEFAULT_EMBEDDING_MODEL=BAAI/bge-large-zh-v1.5
HF_ENDPOINT=https://huggingface.co

4. 访问系统

# 服务启动后访问
http://localhost:80

# 默认账号（首次登录后请修改密码）
用户名: admin@infiniflow.org
密码: admin123

详细使用指南

1. 创建知识库

# 通过Web界面创建知识库
1. 登录RAGFlow管理界面
2. 点击"创建知识库"
3. 填写知识库信息：
   - 名称：技术文档库
   - 描述：公司技术文档知识库
   - 语言：中文
   - Embedding模型：bge-large-zh-v1.5
4. 配置检索参数：
   - 检索方式：混合检索
   - 相似度阈值：0.2
   - 最大检索数量：6

2. 上传和处理文档

# 使用API上传文档
import requests
import json

def upload_document(file_path, knowledge_base_id, api_key):
    """上传文档到指定知识库"""
    url = "http://localhost/api/v1/documents"
    
    headers = {
        "Authorization": f"Bearer {api_key}"
    }
    
    files = {
        "file": open(file_path, "rb")
    }
    
    data = {
        "knowledge_base_id": knowledge_base_id,
        "parser_method": "auto",  # 自动选择解析方法
        "chunk_method": "naive"   # 文本分割方法
    }
    
    response = requests.post(url, headers=headers, files=files, data=data)
    return response.json()

# 使用示例
result = upload_document("./documents/manual.pdf", "kb_001", "your_api_key")
print(f"文档上传结果: {result}")

3. 文档解析配置

# 配置文档解析参数
def configure_document_parser(document_id, parser_config):
    """配置文档解析器"""
    url = f"http://localhost/api/v1/documents/{document_id}/parser"
    
    config = {
        "parser_method": "manual",  # 手动配置
        "chunk_method": "knowledge_graph",  # 知识图谱分割
        "parser_config": {
            "pages": [[1, 10]],  # 指定解析页面范围
            "layout_recognize": True,  # 开启版面识别
            "table_recognize": True,   # 开启表格识别
            "formula_recognize": True  # 开启公式识别
        },
        "chunk_config": {
            "chunk_size": 1024,    # 分块大小
            "chunk_overlap": 128,  # 分块重叠
            "auto_keywords": True, # 自动提取关键词
            "auto_questions": True # 自动生成问题
        }
    }
    
    headers = {"Authorization": f"Bearer {api_key}"}
    response = requests.put(url, headers=headers, json=config)
    return response.json()

4. 智能问答

# 实现智能问答功能
def chat_with_knowledge_base(question, knowledge_base_id, api_key):
    """与知识库进行对话"""
    url = "http://localhost/api/v1/chat"
    
    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json"
    }
    
    data = {
        "knowledge_base_id": knowledge_base_id,
        "question": question,
        "stream": False,  # 是否流式响应
        "quote": True,    # 是否引用原文
        "doc_ids": []     # 限制搜索的文档ID
    }
    
    response = requests.post(url, headers=headers, json=data)
    return response.json()

# 使用示例
answer = chat_with_knowledge_base(
    "如何配置RAGFlow的文档解析参数？", 
    "kb_001", 
    "your_api_key"
)

print(f"问题: {answer['question']}")
print(f"回答: {answer['answer']}")
print(f"引用: {answer['reference']}")

5. 高级检索功能

# 实现混合检索
def hybrid_search(query, knowledge_base_id, api_key):
    """混合检索：结合向量检索和关键词检索"""
    url = "http://localhost/api/v1/retrieval"
    
    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json"
    }
    
    data = {
        "knowledge_base_id": knowledge_base_id,
        "question": query,
        "vector_similarity_weight": 0.7,  # 向量检索权重
        "keyword_similarity_weight": 0.3, # 关键词检索权重
        "top_k": 10,                      # 返回top k结果
        "similarity_threshold": 0.1,      # 相似度阈值
        "keyword_similarity_threshold": 0.0
    }
    
    response = requests.post(url, headers=headers, json=data)
    return response.json()

# 使用示例
results = hybrid_search("机器学习算法", "kb_001", "your_api_key")
for i, result in enumerate(results['chunks']):
    print(f"结果 {i+1}:")
    print(f"  相似度: {result['similarity']}")
    print(f"  内容: {result['content'][:100]}...")
    print(f"  来源: {result['document_name']}")

总结

RAGFlow作为一个功能强大的开源RAG框架，为企业构建智能问答系统提供了完整的解决方案。通过本文的详细介绍，您可以：

快速部署：使用Docker快速搭建RAGFlow环境
灵活配置：根据业务需求调整各种参数
高效使用：掌握文档处理、检索、问答等核心功能
企业级部署：实现高可用、可扩展的生产环境
性能优化：通过各种优化技术提升系统性能

RAGFlow的优势在于其完整性和易用性，既适合快速原型开发，也能满足企业级应用的需求。通过合理的配置和优化，可以构建出高质量、高性能的知识问答系统，为用户提供准确、及时的信息服务。