Skip to content

ArchiveBox

Self-hosted web archiving platform. ArchiveBox captures websites in multiple formats simultaneously — HTML, PDF, screenshot, WARC, media extraction, git clone — using a built-in Chromium headless browser. All archived content and the SQLite database are stored on a single large persistent volume.

Single instance only — no horizontal scaling

ArchiveBox uses SQLite as its database. SQLite allows only a single writer at a time. Running multiple replicas will cause database corruption. Keep replicaCount at 1.

Key Features

  • Multi-format archiving — HTML, PDF, screenshot, WARC, media, git clone in one pass
  • Chromium headless — full JavaScript rendering with /dev/shm tmpfs for stability
  • Three search backends — ripgrep (default), sqlite, or Sonic for full-text search
  • Access control — configurable public/private access for index, snapshots, and adding links
  • Non-root by default — runs as UID 911 out of the box
  • S3 backup — full /data directory backup (SQLite + all archived files) to S3-compatible storage
  • Persistent storage — single large PVC for all archived content and database

Installation

HTTPS repository:

helm repo add helmforge https://repo.helmforge.dev
helm repo update
helm install archivebox helmforge/archivebox -f values.yaml

OCI registry:

helm install archivebox oci://ghcr.io/helmforgedev/helm/archivebox -f values.yaml

Deployment Examples

# values.yaml — Basic ArchiveBox with Traefik ingress
archivebox:
  adminUsername: admin
  adminPassword: 'my-secure-password'

persistence:
  enabled: true
  size: 100Gi

ingress:
  enabled: true
  ingressClassName: traefik
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
  hosts:
    - host: archive.example.com
      paths:
        - path: /
          pathType: Prefix
  tls:
    - secretName: archivebox-tls
      hosts:
        - archive.example.com
# values.yaml — Private instance (no public access)
# Recommended for internet-facing deployments
archivebox:
  adminUsername: admin
  adminPassword: 'my-secure-password'
  allowedHosts: 'archive.example.com'
  publicIndex: 'False'
  publicSnapshots: 'False'
  publicAddLinks: 'False'

persistence:
  enabled: true
  size: 100Gi

ingress:
  enabled: true
  ingressClassName: traefik
  hosts:
    - host: archive.example.com
      paths:
        - path: /
          pathType: Prefix
# values.yaml — Production setup with explicit resource limits
# Chromium requires at least 2Gi RAM to archive pages reliably
archivebox:
  adminUsername: admin
  adminPassword: 'my-secure-password'
  timeout: '120'
  mediaMaxSize: '1g'

resources:
  requests:
    cpu: 500m
    memory: 2Gi
  limits:
    cpu: 2000m
    memory: 4Gi

persistence:
  enabled: true
  size: 200Gi

ingress:
  enabled: true
  ingressClassName: traefik
  hosts:
    - host: archive.example.com
      paths:
        - path: /
          pathType: Prefix
# values.yaml — Daily S3 backup of the full /data directory
# Backup includes both the SQLite database and all archived files.
archivebox:
  adminUsername: admin
  adminPassword: 'my-secure-password'

persistence:
  enabled: true
  size: 100Gi

backup:
  enabled: true
  schedule: '0 3 * * *'
  s3:
    endpoint: https://s3.amazonaws.com
    bucket: my-archivebox-backups
    accessKey: '<set-me>'
    secretKey: '<set-me>'

ingress:
  enabled: true
  ingressClassName: traefik
  hosts:
    - host: archive.example.com
      paths:
        - path: /
          pathType: Prefix

Configuration Reference

Core

ParameterTypeDefaultDescription
nameOverridestring""Override the chart name.
fullnameOverridestring""Override the full release name.
commonLabelsobject{}Extra labels added to all resources.

Image

ParameterTypeDefaultDescription
image.repositorystringdocker.io/archivebox/archiveboxArchiveBox container image.
image.tagstring"0.7.3"Image tag.
image.pullPolicystringIfNotPresentImage pull policy.
imagePullSecretsarray[]Pull secrets for private registries.

ArchiveBox Configuration

ParameterTypeDefaultDescription
archivebox.portinteger8000Internal HTTP port.
archivebox.adminUsernamestringadminAdmin account username, created on first run.
archivebox.adminPasswordstring""Admin account password. Auto-generated if empty.
archivebox.existingSecretstring""Existing Kubernetes Secret containing admin credentials.
archivebox.existingSecretUsernameKeystringadmin-usernameKey in the existing secret for the admin username.
archivebox.existingSecretPasswordKeystringadmin-passwordKey in the existing secret for the admin password.
archivebox.allowedHostsstring"*"Comma-separated allowed hostnames. Set to your domain in production.
archivebox.publicIndexstring"True"Allow unauthenticated access to the archive index page.
archivebox.publicSnapshotsstring"True"Allow unauthenticated access to archived snapshots.
archivebox.publicAddLinksstring"False"Allow unauthenticated users to submit URLs for archiving.
archivebox.searchBackendEnginestringripgrepSearch backend: ripgrep (default), sqlite, or sonic.
archivebox.mediaMaxSizestring"750m"Maximum size for media downloads (e.g. 750m, 1g).
archivebox.timeoutstring"60"Timeout per URL archiving job, in seconds.
archivebox.timezonestringUTCTimezone for scheduled tasks and timestamps.
archivebox.extraEnvarray[]Extra environment variables for advanced configuration.
Restrict public access before exposing to the internet

By default, publicIndex and publicSnapshots are both "True". Anyone who reaches your ArchiveBox URL can browse your entire archive and view all captured pages. For internet-facing deployments, set both to "False" and restrict allowedHosts to your exact domain.

Search backend trade-offs
  • ripgrep (default) — fast grep-based full-text search, no extra dependencies, searches HTML files directly - sqlite — uses SQLite FTS5, no extra setup, slower on large archives - sonic — fastest on large archives, requires a separate Sonic server deployed alongside ArchiveBox

Persistence

The PVC stores the entire /data directory: SQLite database, all archived files (HTML, PDFs, screenshots, WARCs, media), and ArchiveBox configuration. Size it generously — archived pages with media can consume gigabytes quickly.

ParameterTypeDefaultDescription
persistence.enabledbooleantrueEnable a PVC for /data (database + all archived content).
persistence.sizestring50GiPVC size. Plan for 100GB+ for active archiving.
persistence.storageClassstring""StorageClass for the PVC.
persistence.accessModesarray["ReadWriteOnce"]PVC access modes.
persistence.existingClaimstring""Use an existing PVC instead of creating one.
NFS storage may require fsGroup configuration

ArchiveBox runs as UID/GID 911 by default (podSecurityContext.fsGroup: 911). Some NFS provisioners ignore fsGroup and may cause permission errors on the /data directory. If using NFS, configure your provisioner to support fsGroup or override podSecurityContext and securityContext accordingly.

Backup

The S3 backup archives the full /data directory — including the SQLite database and all archived files. This is a complete backup of the entire ArchiveBox dataset, unlike other charts where only media files are backed up.

ParameterTypeDefaultDescription
backup.enabledbooleanfalseEnable scheduled S3 backup CronJob.
backup.schedulestring"0 3 * * *"Cron schedule for backups.
backup.suspendbooleanfalseSuspend the CronJob without deleting it.
backup.concurrencyPolicystringForbidCronJob concurrency policy.
backup.successfulJobsHistoryLimitinteger3Number of successful Job records to keep.
backup.failedJobsHistoryLimitinteger3Number of failed Job records to keep.
backup.backoffLimitinteger1Job retry limit.
backup.archivePrefixstringarchiveboxPrefix for backup archive filenames.
backup.images.tarstringdocker.io/library/alpine:3.22Image used for tar archive.
backup.images.uploaderstringdocker.io/helmforge/mc:1.0.0Image used for S3 upload.
backup.resourcesobject{}Resources for backup containers.
backup.s3.endpointstring""S3-compatible endpoint URL.
backup.s3.bucketstring""Target bucket name.
backup.s3.prefixstringarchiveboxKey prefix within the bucket.
backup.s3.createBucketIfNotExistsbooleantrueCreate the bucket automatically if it does not exist.
backup.s3.existingSecretstring""Existing secret containing S3 access and secret keys.
backup.s3.existingSecretAccessKeyKeystringaccess-keyKey in the existing secret for the S3 access key.
backup.s3.existingSecretSecretKeyKeystringsecret-keyKey in the existing secret for the S3 secret key.
backup.s3.accessKeystring""Inline S3 access key (ignored when existingSecret is set).
backup.s3.secretKeystring""Inline S3 secret key (ignored when existingSecret is set).

Service

ParameterTypeDefaultDescription
service.typestringClusterIPKubernetes service type.
service.portinteger80Service port exposed to the cluster.
service.annotationsobject{}Annotations for the Service.

Ingress

ParameterTypeDefaultDescription
ingress.enabledbooleanfalseEnable an Ingress resource.
ingress.ingressClassNamestringtraefikIngress class name.
ingress.annotationsobject{}Annotations for the Ingress (e.g. cert-manager).
ingress.hostsarray[]Ingress host and path rules.
ingress.tlsarray[]TLS configuration (secret name and hosts).

Probes

Probes use the /health/ endpoint.

ParameterTypeDefaultDescription
probes.startup.enabledbooleantrueEnable startup probe.
probes.startup.initialDelaySecondsinteger15Startup probe initial delay.
probes.startup.periodSecondsinteger5Startup probe period.
probes.startup.timeoutSecondsinteger3Startup probe timeout.
probes.startup.failureThresholdinteger30Startup probe failure threshold.
probes.liveness.enabledbooleantrueEnable liveness probe.
probes.liveness.initialDelaySecondsinteger0Liveness probe initial delay.
probes.liveness.periodSecondsinteger15Liveness probe period.
probes.liveness.timeoutSecondsinteger5Liveness probe timeout.
probes.liveness.failureThresholdinteger3Liveness probe failure threshold.
probes.readiness.enabledbooleantrueEnable readiness probe.
probes.readiness.initialDelaySecondsinteger0Readiness probe initial delay.
probes.readiness.periodSecondsinteger10Readiness probe period.
probes.readiness.timeoutSecondsinteger5Readiness probe timeout.
probes.readiness.failureThresholdinteger3Readiness probe failure threshold.

Resources and Security

ArchiveBox uses Chromium internally to render pages. The Chromium process requires at least 2 GB RAM to function reliably. Without memory limits, the container may be OOMKilled during archiving of JavaScript-heavy pages.

ParameterTypeDefaultDescription
resourcesobject{}CPU and memory requests and limits. Recommended: 2–4 Gi RAM.
podSecurityContextobject{ fsGroup: 911 }Pod-level security context.
securityContextobject{ runAsUser: 911, runAsGroup: 911, runAsNonRoot: true }Container-level security context.

Service Account

ParameterTypeDefaultDescription
serviceAccount.createbooleanfalseCreate a dedicated ServiceAccount.
serviceAccount.namestring""Override the ServiceAccount name.
serviceAccount.annotationsobject{}Annotations for the ServiceAccount.

Scheduling

ParameterTypeDefaultDescription
nodeSelectorobject{}Node selector for scheduling.
tolerationsarray[]Tolerations for scheduling.
affinityobject{}Affinity rules.
topologySpreadConstraintsarray[]Topology spread constraints.
priorityClassNamestring""PriorityClass for the pod.
terminationGracePeriodSecondsinteger30Termination grace period.
podLabelsobject{}Extra labels for the pod.
podAnnotationsobject{}Extra annotations for the pod.

Extra

ParameterTypeDefaultDescription
extraVolumesarray[]Extra volumes to attach to the pod.
extraVolumeMountsarray[]Extra volume mounts for the container.
extraManifestsarray[]Extra Kubernetes manifests deployed alongside the chart.

Common Issues

Pod OOMKilled during archiving

Archiving JavaScript-heavy or media-rich pages triggers full Chromium rendering. Without explicit resources limits, the container may be OOMKilled. Set at least memory: 2Gi in resources.requests and memory: 4Gi in resources.limits. Monitor memory usage during peak archiving.

Archiving times out on slow or complex pages

The default archivebox.timeout is 60 seconds. Pages with slow external resources or heavy JavaScript may time out before the snapshot is complete. Increase timeout to 120 or 180 for more reliable archiving of complex pages.

More Information