Docker 安装 Firecrawl
Firecrawl 是一项 API 服务,可接收 URL、进行爬取,并将内容转换为干净的 Markdown。
简介
Firecrawl 是一项 API 服务,可接收 URL、进行爬取,并将内容转换为干净的 Markdown。我们会爬取所有可访问的子页面,并为每个页面提供干净的 Markdown。无需提供 sitemap。
准备
克隆 Firecrawl 仓库。
shell
git clone https://github.com/firecrawl/firecrawl.git
复制 .env.example
为 .env
,修改关键配置。
properties
# ===== 必填环境变量 ======
NUM_WORKERS_PER_QUEUE=8
PORT=3002
HOST=0.0.0.0
REDIS_URL=redis://redis:6379 #for self-hosting using docker, use redis://redis:6379. For running locally, use redis://localhost:6379
REDIS_RATE_LIMIT_URL=redis://redis:6379 #for self-hosting using docker, use redis://redis:6379. For running locally, use redis://localhost:6379
PLAYWRIGHT_MICROSERVICE_URL=http://playwright-service:3000/scrape
## 如需开启数据库身份验证,你需要配置 Supabase。
USE_DB_AUTHENTICATION=false
修改 Dockerfile
,指定国内镜像源,加速构建,也预防 timeout。
dockerfile
# syntax=docker/dockerfile:1
FROM node:22-slim AS base
ENV PNPM_HOME="/pnpm"
ENV PATH="$PNPM_HOME:$PATH"
ENV CI=true
# 替换 Debian/Ubuntu 软件源为国内镜像
RUN sed -i 's/deb.debian.org/mirrors.aliyun.com/g' /etc/apt/sources.list.d/debian.sources && \
sed -i 's/security.debian.org/mirrors.aliyun.com/g' /etc/apt/sources.list.d/debian.sources
# 设置环境变量用于加速下载
ENV NODEJS_ORG_MIRROR=https://npmmirror.com/mirrors/node
ENV NVM_NODEJS_ORG_MIRROR=https://npmmirror.com/mirrors/node
# 设置 npm 国内镜像源
RUN npm config set registry https://registry.npmmirror.com
RUN corepack enable
# Build Go shared library
FROM golang:1.24 AS go-build
WORKDIR /app
COPY sharedLibs/go-html-to-md ./sharedLibs/go-html-to-md
RUN go env -w GOPROXY=https://goproxy.cn,direct && \
cd sharedLibs/go-html-to-md && \
go mod download && \
go build -o libhtml-to-markdown.so -buildmode=c-shared html-to-markdown.go
FROM base AS build
WORKDIR /app
# Install system dependencies
RUN apt-get update && apt-get install -y \
curl \
build-essential \
pkg-config \
&& rm -rf /var/lib/apt/lists/*
# Install Rust
ENV RUSTUP_HOME=/usr/local/rustup \
CARGO_HOME=/usr/local/cargo \
PATH=/usr/local/cargo/bin:$PATH
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y --no-modify-path \
&& chmod -R a+w $RUSTUP_HOME $CARGO_HOME
# Copy source files
COPY pnpm-lock.yaml pnpm-workspace.yaml package.json ./
COPY . .
# Install dependencies
RUN --mount=type=cache,id=pnpm,target=/pnpm/store \
--mount=type=cache,target=/usr/local/cargo/registry \
--mount=type=cache,target=/app/native/target \
pnpm install --frozen-lockfile
# Build the application
RUN pnpm run build
# Remove dev dependencies
RUN pnpm prune --prod --ignore-scripts
# Runtime stage
FROM base AS runtime
# Install runtime dependencies
RUN apt-get update && apt-get install -y \
git \
procps \
&& rm -rf /var/lib/apt/lists/*
EXPOSE 8080
WORKDIR /app
# Copy built application
COPY --from=build /app/node_modules ./node_modules
COPY --from=build /app/dist ./dist
COPY --from=build /app/native ./native
# Copy Go shared library
COPY --from=go-build /app/sharedLibs/go-html-to-md/libhtml-to-markdown.so ./sharedLibs/go-html-to-md/
CMD ["node", "dist/src/harness.js", "--start-docker"]
根据需要修改 docker-compose.yaml
,不用担心 Redis 服务端口冲突,因为没有配置端口映射。
yaml
name: firecrawl
x-common-service: &common-service
# NOTE: If you don't want to build the service locally,
# uncomment the build: statement and comment out the image: statement
# image: ghcr.io/firecrawl/firecrawl
build: apps/api
ulimits:
nofile:
soft: 65535
hard: 65535
networks:
- backend
extra_hosts:
- "host.docker.internal:host-gateway"
x-common-env: &common-env
REDIS_URL: ${REDIS_URL:-redis://redis:6379}
REDIS_RATE_LIMIT_URL: ${REDIS_URL:-redis://redis:6379}
PLAYWRIGHT_MICROSERVICE_URL: ${PLAYWRIGHT_MICROSERVICE_URL:-http://playwright-service:3000/scrape}
NUQ_DATABASE_URL: postgres://postgres:postgres@nuq-postgres:5432/postgres
USE_DB_AUTHENTICATION: ${USE_DB_AUTHENTICATION}
OPENAI_API_KEY: ${OPENAI_API_KEY}
OPENAI_BASE_URL: ${OPENAI_BASE_URL}
MODEL_NAME: ${MODEL_NAME}
MODEL_EMBEDDING_NAME: ${MODEL_EMBEDDING_NAME}
OLLAMA_BASE_URL: ${OLLAMA_BASE_URL}
SLACK_WEBHOOK_URL: ${SLACK_WEBHOOK_URL}
BULL_AUTH_KEY: ${BULL_AUTH_KEY}
TEST_API_KEY: ${TEST_API_KEY}
POSTHOG_API_KEY: ${POSTHOG_API_KEY}
POSTHOG_HOST: ${POSTHOG_HOST}
SUPABASE_ANON_TOKEN: ${SUPABASE_ANON_TOKEN}
SUPABASE_URL: ${SUPABASE_URL}
SUPABASE_SERVICE_TOKEN: ${SUPABASE_SERVICE_TOKEN}
SELF_HOSTED_WEBHOOK_URL: ${SELF_HOSTED_WEBHOOK_URL}
SERPER_API_KEY: ${SERPER_API_KEY}
SEARCHAPI_API_KEY: ${SEARCHAPI_API_KEY}
LOGGING_LEVEL: ${LOGGING_LEVEL}
PROXY_SERVER: ${PROXY_SERVER}
PROXY_USERNAME: ${PROXY_USERNAME}
PROXY_PASSWORD: ${PROXY_PASSWORD}
SEARXNG_ENDPOINT: ${SEARXNG_ENDPOINT}
SEARXNG_ENGINES: ${SEARXNG_ENGINES}
SEARXNG_CATEGORIES: ${SEARXNG_CATEGORIES}
services:
playwright-service:
# NOTE: If you don't want to build the service locally,
# uncomment the build: statement and comment out the image: statement
# image: ghcr.io/firecrawl/playwright-service:latest
build: apps/playwright-service-ts
environment:
PORT: 3000
PROXY_SERVER: ${PROXY_SERVER}
PROXY_USERNAME: ${PROXY_USERNAME}
PROXY_PASSWORD: ${PROXY_PASSWORD}
BLOCK_MEDIA: ${BLOCK_MEDIA}
networks:
- backend
api:
<<: *common-service
environment:
<<: *common-env
HOST: "0.0.0.0"
PORT: ${INTERNAL_PORT:-3002}
WORKER_PORT: ${WORKER_PORT:-3005}
ENV: local
depends_on:
- redis
- playwright-service
ports:
- "${PORT:-3002}:${INTERNAL_PORT:-3002}"
command: node dist/src/harness.js --start-docker
redis:
# NOTE: If you want to use Valkey (open source) instead of Redis (source available),
# uncomment the Valkey statement and comment out the Redis statement.
# Using Valkey with Firecrawl is untested and not guaranteed to work. Use with caution.
image: redis:alpine
# image: valkey/valkey:alpine
networks:
- backend
command: redis-server --bind 0.0.0.0
nuq-postgres:
build: apps/nuq-postgres
environment:
POSTGRES_USER: postgres
POSTGRES_PASSWORD: postgres
POSTGRES_DB: postgres
networks:
- backend
ports:
- "5432:5432"
networks:
backend:
driver: bridge
构建并运行容器
构建并运行 Docker 容器(构建就花了几十分钟)。
shell
docker compose build --no-cache
docker compose up -d
这将启动一个本地 Firecrawl 实例,可通过 http://localhost:3002
访问,测试如果返回 Hello, world!
则说明安装成功。
shell
curl http://localhost:3002/test
API 测试
抓取单个网页
查阅 Scrape API 文档。
shell
curl --location --request POST 'http://localhost:3002/v2/scrape' \
--header 'Content-Type: application/json' \
--data-raw '{
"url": "https://docs.firecrawl.dev/zh/introduction",
"formats": [
"markdown"
]
}'
爬取网站
查阅 Crawl API 文档。
shell
curl --location --request POST 'http://localhost:3002/v2/crawl' \
--header 'Content-Type: application/json' \
--data-raw '{
"url": "https://docs.firecrawl.dev",
"limit": 10,
"scrapeOptions": {
"formats": [
"markdown",
"html"
]
}
}'
常见问题
Docker 构建遇到错误并中断:ERROR [api go-build 4/4] RUN cd sharedLibs/go-html-to-md && go mod download && go build -o
text
> [api go-build 4/4] RUN cd sharedLibs/go-html-to-md && go mod download && go build -o libhtml-to-markdown.so -buildmode=c-shared html-to-markdown.go:
60.42 go: github.com/PuerkitoBio/goquery@v1.10.3: Get "https://proxy.golang.org/github.com/%21puerkito%21bio/goquery/@v/v1.10.3.mod": dial tcp 142.250.204.49:443: i/o timeout
------
failed to solve: process "/bin/sh -c cd sharedLibs/go-html-to-md && go mod download && go build -o libhtml-to-markdown.so -buildmode=c-shared html-to-markdown.go" did not complete successfully: exit code: 1
解决方法: 修改 Dockerfile,指定国内镜像源,
参考资料
- 官方文档-自托管:https://docs.firecrawl.dev/zh/contributing/self-host
- docker方式本地部署安装firecrawl:http://www.884358.com/firecrawl/
- Firecrawl深度解析:37K Star开源爬虫的批量采集与API实战,赋能大模型数据获取:https://www.aigc.bar/AI资讯文章/2025/05/20/firecrawl-deep-dive-batch-scraping-api-guide-for-llm-data