ArchiveBox
Self-hosted web archiving platform. ArchiveBox captures websites in multiple formats simultaneously — HTML, PDF, screenshot, WARC, media extraction, git clone — using a built-in Chromium headless browser. All archived content and the SQLite database are stored on a single large persistent volume.
ArchiveBox uses SQLite as its database. SQLite allows only a single writer at a time. Running multiple replicas will
cause database corruption. Keep replicaCount at 1.
Key Features
- Multi-format archiving — HTML, PDF, screenshot, WARC, media, git clone in one pass
- Chromium headless — full JavaScript rendering with
/dev/shmtmpfs for stability - Three search backends — ripgrep (default), sqlite, or Sonic for full-text search
- Access control — configurable public/private access for index, snapshots, and adding links
- Non-root by default — runs as UID 911 out of the box
- S3 backup — full
/datadirectory backup (SQLite + all archived files) to S3-compatible storage - Persistent storage — single large PVC for all archived content and database
Installation
HTTPS repository:
helm repo add helmforge https://repo.helmforge.dev
helm repo update
helm install archivebox helmforge/archivebox -f values.yaml
OCI registry:
helm install archivebox oci://ghcr.io/helmforgedev/helm/archivebox -f values.yaml
Deployment Examples
# values.yaml — Basic ArchiveBox with Traefik ingress
archivebox:
adminUsername: admin
adminPassword: 'my-secure-password'
persistence:
enabled: true
size: 100Gi
ingress:
enabled: true
ingressClassName: traefik
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
hosts:
- host: archive.example.com
paths:
- path: /
pathType: Prefix
tls:
- secretName: archivebox-tls
hosts:
- archive.example.com# values.yaml — Private instance (no public access)
# Recommended for internet-facing deployments
archivebox:
adminUsername: admin
adminPassword: 'my-secure-password'
allowedHosts: 'archive.example.com'
publicIndex: 'False'
publicSnapshots: 'False'
publicAddLinks: 'False'
persistence:
enabled: true
size: 100Gi
ingress:
enabled: true
ingressClassName: traefik
hosts:
- host: archive.example.com
paths:
- path: /
pathType: Prefix# values.yaml — Production setup with explicit resource limits
# Chromium requires at least 2Gi RAM to archive pages reliably
archivebox:
adminUsername: admin
adminPassword: 'my-secure-password'
timeout: '120'
mediaMaxSize: '1g'
resources:
requests:
cpu: 500m
memory: 2Gi
limits:
cpu: 2000m
memory: 4Gi
persistence:
enabled: true
size: 200Gi
ingress:
enabled: true
ingressClassName: traefik
hosts:
- host: archive.example.com
paths:
- path: /
pathType: Prefix# values.yaml — Daily S3 backup of the full /data directory
# Backup includes both the SQLite database and all archived files.
archivebox:
adminUsername: admin
adminPassword: 'my-secure-password'
persistence:
enabled: true
size: 100Gi
backup:
enabled: true
schedule: '0 3 * * *'
s3:
endpoint: https://s3.amazonaws.com
bucket: my-archivebox-backups
accessKey: '<set-me>'
secretKey: '<set-me>'
ingress:
enabled: true
ingressClassName: traefik
hosts:
- host: archive.example.com
paths:
- path: /
pathType: PrefixConfiguration Reference
Core
| Parameter | Type | Default | Description |
|---|---|---|---|
nameOverride | string | "" | Override the chart name. |
fullnameOverride | string | "" | Override the full release name. |
commonLabels | object | {} | Extra labels added to all resources. |
Image
| Parameter | Type | Default | Description |
|---|---|---|---|
image.repository | string | docker.io/archivebox/archivebox | ArchiveBox container image. |
image.tag | string | "0.7.3" | Image tag. |
image.pullPolicy | string | IfNotPresent | Image pull policy. |
imagePullSecrets | array | [] | Pull secrets for private registries. |
ArchiveBox Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
archivebox.port | integer | 8000 | Internal HTTP port. |
archivebox.adminUsername | string | admin | Admin account username, created on first run. |
archivebox.adminPassword | string | "" | Admin account password. Auto-generated if empty. |
archivebox.existingSecret | string | "" | Existing Kubernetes Secret containing admin credentials. |
archivebox.existingSecretUsernameKey | string | admin-username | Key in the existing secret for the admin username. |
archivebox.existingSecretPasswordKey | string | admin-password | Key in the existing secret for the admin password. |
archivebox.allowedHosts | string | "*" | Comma-separated allowed hostnames. Set to your domain in production. |
archivebox.publicIndex | string | "True" | Allow unauthenticated access to the archive index page. |
archivebox.publicSnapshots | string | "True" | Allow unauthenticated access to archived snapshots. |
archivebox.publicAddLinks | string | "False" | Allow unauthenticated users to submit URLs for archiving. |
archivebox.searchBackendEngine | string | ripgrep | Search backend: ripgrep (default), sqlite, or sonic. |
archivebox.mediaMaxSize | string | "750m" | Maximum size for media downloads (e.g. 750m, 1g). |
archivebox.timeout | string | "60" | Timeout per URL archiving job, in seconds. |
archivebox.timezone | string | UTC | Timezone for scheduled tasks and timestamps. |
archivebox.extraEnv | array | [] | Extra environment variables for advanced configuration. |
By default, publicIndex and publicSnapshots are both "True". Anyone who reaches your ArchiveBox URL can browse
your entire archive and view all captured pages. For internet-facing deployments, set both to "False" and restrict
allowedHosts to your exact domain.
- ripgrep (default) — fast grep-based full-text search, no extra dependencies, searches HTML files directly - sqlite — uses SQLite FTS5, no extra setup, slower on large archives - sonic — fastest on large archives, requires a separate Sonic server deployed alongside ArchiveBox
Persistence
The PVC stores the entire /data directory: SQLite database, all archived files (HTML, PDFs, screenshots, WARCs,
media), and ArchiveBox configuration. Size it generously — archived pages with media can consume gigabytes quickly.
| Parameter | Type | Default | Description |
|---|---|---|---|
persistence.enabled | boolean | true | Enable a PVC for /data (database + all archived content). |
persistence.size | string | 50Gi | PVC size. Plan for 100GB+ for active archiving. |
persistence.storageClass | string | "" | StorageClass for the PVC. |
persistence.accessModes | array | ["ReadWriteOnce"] | PVC access modes. |
persistence.existingClaim | string | "" | Use an existing PVC instead of creating one. |
ArchiveBox runs as UID/GID 911 by default (podSecurityContext.fsGroup: 911). Some NFS provisioners ignore fsGroup
and may cause permission errors on the /data directory. If using NFS, configure your provisioner to support fsGroup
or override podSecurityContext and securityContext accordingly.
Backup
The S3 backup archives the full /data directory — including the SQLite database and all archived files. This is a
complete backup of the entire ArchiveBox dataset, unlike other charts where only media files are backed up.
| Parameter | Type | Default | Description |
|---|---|---|---|
backup.enabled | boolean | false | Enable scheduled S3 backup CronJob. |
backup.schedule | string | "0 3 * * *" | Cron schedule for backups. |
backup.suspend | boolean | false | Suspend the CronJob without deleting it. |
backup.concurrencyPolicy | string | Forbid | CronJob concurrency policy. |
backup.successfulJobsHistoryLimit | integer | 3 | Number of successful Job records to keep. |
backup.failedJobsHistoryLimit | integer | 3 | Number of failed Job records to keep. |
backup.backoffLimit | integer | 1 | Job retry limit. |
backup.archivePrefix | string | archivebox | Prefix for backup archive filenames. |
backup.images.tar | string | docker.io/library/alpine:3.22 | Image used for tar archive. |
backup.images.uploader | string | docker.io/helmforge/mc:1.0.0 | Image used for S3 upload. |
backup.resources | object | {} | Resources for backup containers. |
backup.s3.endpoint | string | "" | S3-compatible endpoint URL. |
backup.s3.bucket | string | "" | Target bucket name. |
backup.s3.prefix | string | archivebox | Key prefix within the bucket. |
backup.s3.createBucketIfNotExists | boolean | true | Create the bucket automatically if it does not exist. |
backup.s3.existingSecret | string | "" | Existing secret containing S3 access and secret keys. |
backup.s3.existingSecretAccessKeyKey | string | access-key | Key in the existing secret for the S3 access key. |
backup.s3.existingSecretSecretKeyKey | string | secret-key | Key in the existing secret for the S3 secret key. |
backup.s3.accessKey | string | "" | Inline S3 access key (ignored when existingSecret is set). |
backup.s3.secretKey | string | "" | Inline S3 secret key (ignored when existingSecret is set). |
Service
| Parameter | Type | Default | Description |
|---|---|---|---|
service.type | string | ClusterIP | Kubernetes service type. |
service.port | integer | 80 | Service port exposed to the cluster. |
service.annotations | object | {} | Annotations for the Service. |
Ingress
| Parameter | Type | Default | Description |
|---|---|---|---|
ingress.enabled | boolean | false | Enable an Ingress resource. |
ingress.ingressClassName | string | traefik | Ingress class name. |
ingress.annotations | object | {} | Annotations for the Ingress (e.g. cert-manager). |
ingress.hosts | array | [] | Ingress host and path rules. |
ingress.tls | array | [] | TLS configuration (secret name and hosts). |
Probes
Probes use the /health/ endpoint.
| Parameter | Type | Default | Description |
|---|---|---|---|
probes.startup.enabled | boolean | true | Enable startup probe. |
probes.startup.initialDelaySeconds | integer | 15 | Startup probe initial delay. |
probes.startup.periodSeconds | integer | 5 | Startup probe period. |
probes.startup.timeoutSeconds | integer | 3 | Startup probe timeout. |
probes.startup.failureThreshold | integer | 30 | Startup probe failure threshold. |
probes.liveness.enabled | boolean | true | Enable liveness probe. |
probes.liveness.initialDelaySeconds | integer | 0 | Liveness probe initial delay. |
probes.liveness.periodSeconds | integer | 15 | Liveness probe period. |
probes.liveness.timeoutSeconds | integer | 5 | Liveness probe timeout. |
probes.liveness.failureThreshold | integer | 3 | Liveness probe failure threshold. |
probes.readiness.enabled | boolean | true | Enable readiness probe. |
probes.readiness.initialDelaySeconds | integer | 0 | Readiness probe initial delay. |
probes.readiness.periodSeconds | integer | 10 | Readiness probe period. |
probes.readiness.timeoutSeconds | integer | 5 | Readiness probe timeout. |
probes.readiness.failureThreshold | integer | 3 | Readiness probe failure threshold. |
Resources and Security
ArchiveBox uses Chromium internally to render pages. The Chromium process requires at least 2 GB RAM to function reliably. Without memory limits, the container may be OOMKilled during archiving of JavaScript-heavy pages.
| Parameter | Type | Default | Description |
|---|---|---|---|
resources | object | {} | CPU and memory requests and limits. Recommended: 2–4 Gi RAM. |
podSecurityContext | object | { fsGroup: 911 } | Pod-level security context. |
securityContext | object | { runAsUser: 911, runAsGroup: 911, runAsNonRoot: true } | Container-level security context. |
Service Account
| Parameter | Type | Default | Description |
|---|---|---|---|
serviceAccount.create | boolean | false | Create a dedicated ServiceAccount. |
serviceAccount.name | string | "" | Override the ServiceAccount name. |
serviceAccount.annotations | object | {} | Annotations for the ServiceAccount. |
Scheduling
| Parameter | Type | Default | Description |
|---|---|---|---|
nodeSelector | object | {} | Node selector for scheduling. |
tolerations | array | [] | Tolerations for scheduling. |
affinity | object | {} | Affinity rules. |
topologySpreadConstraints | array | [] | Topology spread constraints. |
priorityClassName | string | "" | PriorityClass for the pod. |
terminationGracePeriodSeconds | integer | 30 | Termination grace period. |
podLabels | object | {} | Extra labels for the pod. |
podAnnotations | object | {} | Extra annotations for the pod. |
Extra
| Parameter | Type | Default | Description |
|---|---|---|---|
extraVolumes | array | [] | Extra volumes to attach to the pod. |
extraVolumeMounts | array | [] | Extra volume mounts for the container. |
extraManifests | array | [] | Extra Kubernetes manifests deployed alongside the chart. |
Common Issues
Archiving JavaScript-heavy or media-rich pages triggers full Chromium rendering. Without explicit resources limits,
the container may be OOMKilled. Set at least memory: 2Gi in resources.requests and memory: 4Gi in
resources.limits. Monitor memory usage during peak archiving.
The default archivebox.timeout is 60 seconds. Pages with slow external resources or heavy JavaScript may time out
before the snapshot is complete. Increase timeout to 120 or 180 for more reliable archiving of complex pages.