Skip to content

Docker 安装 Firecrawl

Firecrawl 是一项 API 服务,可接收 URL、进行爬取,并将内容转换为干净的 Markdown。


简介

Firecrawl 是一项 API 服务,可接收 URL、进行爬取,并将内容转换为干净的 Markdown。我们会爬取所有可访问的子页面,并为每个页面提供干净的 Markdown。无需提供 sitemap。

准备

克隆 Firecrawl 仓库。

shell
git clone https://github.com/firecrawl/firecrawl.git

复制 .env.example.env,修改关键配置。

firecrawl/apps/api/.env
properties
# ===== 必填环境变量 ======
NUM_WORKERS_PER_QUEUE=8
PORT=3002
HOST=0.0.0.0
REDIS_URL=redis://redis:6379 #for self-hosting using docker, use redis://redis:6379. For running locally, use redis://localhost:6379
REDIS_RATE_LIMIT_URL=redis://redis:6379 #for self-hosting using docker, use redis://redis:6379. For running locally, use redis://localhost:6379
PLAYWRIGHT_MICROSERVICE_URL=http://playwright-service:3000/scrape

## 如需开启数据库身份验证,你需要配置 Supabase。
USE_DB_AUTHENTICATION=false

修改 Dockerfile,指定国内镜像源,加速构建,也预防 timeout。

firecrawl/apps/api/Dockerfile
dockerfile
# syntax=docker/dockerfile:1
FROM node:22-slim AS base

ENV PNPM_HOME="/pnpm"
ENV PATH="$PNPM_HOME:$PATH"
ENV CI=true

# 替换 Debian/Ubuntu 软件源为国内镜像
RUN sed -i 's/deb.debian.org/mirrors.aliyun.com/g' /etc/apt/sources.list.d/debian.sources && \
    sed -i 's/security.debian.org/mirrors.aliyun.com/g' /etc/apt/sources.list.d/debian.sources
# 设置环境变量用于加速下载
ENV NODEJS_ORG_MIRROR=https://npmmirror.com/mirrors/node
ENV NVM_NODEJS_ORG_MIRROR=https://npmmirror.com/mirrors/node
# 设置 npm 国内镜像源
RUN npm config set registry https://registry.npmmirror.com

RUN corepack enable

# Build Go shared library
FROM golang:1.24 AS go-build
WORKDIR /app
COPY sharedLibs/go-html-to-md ./sharedLibs/go-html-to-md

RUN go env -w GOPROXY=https://goproxy.cn,direct && \
    cd sharedLibs/go-html-to-md && \
    go mod download && \
    go build -o libhtml-to-markdown.so -buildmode=c-shared html-to-markdown.go

FROM base AS build
WORKDIR /app

# Install system dependencies
RUN apt-get update && apt-get install -y \
    curl \
    build-essential \
    pkg-config \
    && rm -rf /var/lib/apt/lists/*

# Install Rust
ENV RUSTUP_HOME=/usr/local/rustup \
    CARGO_HOME=/usr/local/cargo \
    PATH=/usr/local/cargo/bin:$PATH

RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y --no-modify-path \
    && chmod -R a+w $RUSTUP_HOME $CARGO_HOME

# Copy source files
COPY pnpm-lock.yaml pnpm-workspace.yaml package.json ./
COPY . .

# Install dependencies
RUN --mount=type=cache,id=pnpm,target=/pnpm/store \
    --mount=type=cache,target=/usr/local/cargo/registry \
    --mount=type=cache,target=/app/native/target \
    pnpm install --frozen-lockfile

# Build the application
RUN pnpm run build

# Remove dev dependencies
RUN pnpm prune --prod --ignore-scripts

# Runtime stage
FROM base AS runtime

# Install runtime dependencies
RUN apt-get update && apt-get install -y \
    git \
    procps \
    && rm -rf /var/lib/apt/lists/*

EXPOSE 8080
WORKDIR /app

# Copy built application
COPY --from=build /app/node_modules ./node_modules
COPY --from=build /app/dist ./dist
COPY --from=build /app/native ./native

# Copy Go shared library
COPY --from=go-build /app/sharedLibs/go-html-to-md/libhtml-to-markdown.so ./sharedLibs/go-html-to-md/

CMD ["node", "dist/src/harness.js", "--start-docker"]

根据需要修改 docker-compose.yaml,不用担心 Redis 服务端口冲突,因为没有配置端口映射。

firecrawl/docker-compose.yaml
yaml
name: firecrawl

x-common-service: &common-service
  # NOTE: If you don't want to build the service locally,
  # uncomment the build: statement and comment out the image: statement
  # image: ghcr.io/firecrawl/firecrawl
  build: apps/api

  ulimits:
    nofile:
      soft: 65535
      hard: 65535
  networks:
    - backend
  extra_hosts:
    - "host.docker.internal:host-gateway"

x-common-env: &common-env
  REDIS_URL: ${REDIS_URL:-redis://redis:6379}
  REDIS_RATE_LIMIT_URL: ${REDIS_URL:-redis://redis:6379}
  PLAYWRIGHT_MICROSERVICE_URL: ${PLAYWRIGHT_MICROSERVICE_URL:-http://playwright-service:3000/scrape}
  NUQ_DATABASE_URL: postgres://postgres:postgres@nuq-postgres:5432/postgres
  USE_DB_AUTHENTICATION: ${USE_DB_AUTHENTICATION}
  OPENAI_API_KEY: ${OPENAI_API_KEY}
  OPENAI_BASE_URL: ${OPENAI_BASE_URL}
  MODEL_NAME: ${MODEL_NAME}
  MODEL_EMBEDDING_NAME: ${MODEL_EMBEDDING_NAME} 
  OLLAMA_BASE_URL: ${OLLAMA_BASE_URL} 
  SLACK_WEBHOOK_URL: ${SLACK_WEBHOOK_URL}
  BULL_AUTH_KEY: ${BULL_AUTH_KEY}
  TEST_API_KEY: ${TEST_API_KEY}
  POSTHOG_API_KEY: ${POSTHOG_API_KEY}
  POSTHOG_HOST: ${POSTHOG_HOST}
  SUPABASE_ANON_TOKEN: ${SUPABASE_ANON_TOKEN}
  SUPABASE_URL: ${SUPABASE_URL}
  SUPABASE_SERVICE_TOKEN: ${SUPABASE_SERVICE_TOKEN}
  SELF_HOSTED_WEBHOOK_URL: ${SELF_HOSTED_WEBHOOK_URL}
  SERPER_API_KEY: ${SERPER_API_KEY}
  SEARCHAPI_API_KEY: ${SEARCHAPI_API_KEY}
  LOGGING_LEVEL: ${LOGGING_LEVEL}
  PROXY_SERVER: ${PROXY_SERVER}
  PROXY_USERNAME: ${PROXY_USERNAME}
  PROXY_PASSWORD: ${PROXY_PASSWORD}
  SEARXNG_ENDPOINT: ${SEARXNG_ENDPOINT}
  SEARXNG_ENGINES: ${SEARXNG_ENGINES}
  SEARXNG_CATEGORIES: ${SEARXNG_CATEGORIES}

services:
  playwright-service:
    # NOTE: If you don't want to build the service locally,
    # uncomment the build: statement and comment out the image: statement
    # image: ghcr.io/firecrawl/playwright-service:latest
    build: apps/playwright-service-ts
    environment:
      PORT: 3000
      PROXY_SERVER: ${PROXY_SERVER}
      PROXY_USERNAME: ${PROXY_USERNAME}
      PROXY_PASSWORD: ${PROXY_PASSWORD}
      BLOCK_MEDIA: ${BLOCK_MEDIA}
    networks:
      - backend

  api:
    <<: *common-service
    environment:
      <<: *common-env
      HOST: "0.0.0.0"
      PORT: ${INTERNAL_PORT:-3002}
      WORKER_PORT: ${WORKER_PORT:-3005}
      ENV: local
    depends_on:
      - redis
      - playwright-service
    ports:
      - "${PORT:-3002}:${INTERNAL_PORT:-3002}"
    command: node dist/src/harness.js --start-docker

  redis:
    # NOTE: If you want to use Valkey (open source) instead of Redis (source available),
    # uncomment the Valkey statement and comment out the Redis statement.
    # Using Valkey with Firecrawl is untested and not guaranteed to work. Use with caution.
    image: redis:alpine
    # image: valkey/valkey:alpine

    networks:
      - backend
    command: redis-server --bind 0.0.0.0
  
  nuq-postgres:
    build: apps/nuq-postgres
    environment:
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: postgres
      POSTGRES_DB: postgres
    networks:
      - backend
    ports:
      - "5432:5432"

networks:
  backend:
    driver: bridge

构建并运行容器

构建并运行 Docker 容器(构建就花了几十分钟)。

shell
docker compose build --no-cache

docker compose up -d

这将启动一个本地 Firecrawl 实例,可通过 http://localhost:3002 访问,测试如果返回 Hello, world! 则说明安装成功。

shell
curl http://localhost:3002/test

API 测试

抓取单个网页

查阅 Scrape API 文档

shell
curl --location --request POST 'http://localhost:3002/v2/scrape' \
--header 'Content-Type: application/json' \
--data-raw '{
    "url": "https://docs.firecrawl.dev/zh/introduction",
    "formats": [
        "markdown"
    ]
}'

爬取网站

查阅 Crawl API 文档

shell
curl --location --request POST 'http://localhost:3002/v2/crawl' \
--header 'Content-Type: application/json' \
--data-raw '{
    "url": "https://docs.firecrawl.dev",
    "limit": 10,
    "scrapeOptions": {
        "formats": [
            "markdown",
            "html"
        ]
    }
}'

常见问题

Docker 构建遇到错误并中断:ERROR [api go-build 4/4] RUN cd sharedLibs/go-html-to-md && go mod download && go build -o
text
 > [api go-build 4/4] RUN cd sharedLibs/go-html-to-md &&     go mod download &&     go build -o libhtml-to-markdown.so -buildmode=c-shared html-to-markdown.go:
60.42 go: github.com/PuerkitoBio/goquery@v1.10.3: Get "https://proxy.golang.org/github.com/%21puerkito%21bio/goquery/@v/v1.10.3.mod": dial tcp 142.250.204.49:443: i/o timeout
------
failed to solve: process "/bin/sh -c cd sharedLibs/go-html-to-md &&     go mod download &&     go build -o libhtml-to-markdown.so -buildmode=c-shared html-to-markdown.go" did not complete successfully: exit code: 1

解决方法: 修改 Dockerfile,指定国内镜像源,

参考资料

  1. 官方文档-自托管:https://docs.firecrawl.dev/zh/contributing/self-host
  2. docker方式本地部署安装firecrawl:http://www.884358.com/firecrawl/
  3. Firecrawl深度解析:37K Star开源爬虫的批量采集与API实战,赋能大模型数据获取:https://www.aigc.bar/AI资讯文章/2025/05/20/firecrawl-deep-dive-batch-scraping-api-guide-for-llm-data