Deduplication with bacula using base jobs

What’s Bacula?

Bacula is an open-source set of tools (client and server) for network file-level backup and restore.

Features I like are:

  • Optional encrypted transmission and storage (at rest) of data.
  • Flexibility re backup scheduling, levels, target files, etc.
  • Backs into a MySQL / PostgreSQL database, which makes custom reporting and monitoring possible.

Recently I’ve been working on updating our design from version 5.0.0 to update to version 7.0.5, and one of the features I wanted to leverage was the ability do “deduplicate” backups using base jobs.

Deduplication using Bacula Base Jobs

By using base jobs, it’s possible to avoid backing up identical files on multiple hosts - in our case, we have tens of hosts running CentOS 6, all with ~800MB of operating system files, which are identical across hosts. By using a base job, I can backup all the OS files once (the base job), and all other backups will simply refer to the base job rather than backing up the files again.

Bacula’s documentation is a little sparse on this, so I’m including a working example below.

Working example of a bacula base job

A base job:

# This exists to define a base for CentOS 6 hosts
Job {
  Name = "base-centos6"
  JobDefs = "jd-base"
  Schedule = base
  Level = Base
  Client = <base-machine-hostname>-fd
}

A normal job which depends on the base job:

Job {
  Name = "<bacula-client-hostname>-daily"
  JobDefs = "jd-daily"
  Client = <bacula-client-hostname>-fd
  Base = <base os backup>
  Write Bootstrap = "/var/spool/bacula/<bacula-client-hostname>.bsr"
}

A standard “daily” job definition. Note that this must be an “Accurate” job. If you leave out the “Accurate” directive, a base job will not be used.

JobDefs {
  Name = "jd-daily"
  Type = Backup
  Level = Incremental
  Client = <bacula-client-hostname>-fd
  FileSet = "fs-linux"
  Schedule = "daily"
  Storage = vchanger
  Messages = Daemon
  Pool = daily
  Priority = 10
  Max Run Time = 360 minutes
  # Required for dedupe using base jobs
  Accurate = yes
  # Required to avoid putting jobs in DB before completion (avoids cancelled jobs in the DB)
  Spool Attributes = yes
}

A corresponding “base” job definition. It’s the same as the jd-daily definition above, except for the Level (base jobs are level Base, obviously), Schedule, and Pool (I like to run my base jobs roughly monthly, and keep their storage away from other jobs) fields. I set Client to the bacula client I want to use to create the base job:

JobDefs {
  Name = "jd-base"
  Type = Backup
  Level = Base
  Client = <bacula-client-hostname>-fd
  FileSet = "fs-linux"
  Schedule = "base"
  Storage = vchanger
  Messages = Daemon
  Pool = base
  Priority = 9
  Max Run Time = 600 minutes
  # Required for dedupe using base jobs
  Accurate = yes
  # Required to avoid putting jobs in DB before completion (avoids cancelled jobs in the DB)
  Spool Attributes = yes }

Sample output of a bacula base job

You know your base job is working, if your job output includes the following on starting:

14-Oct 23:08 bacula-dir JobId 21: Using Device "vchanger-0" to write.
14-Oct 23:08 bacula-dir JobId 21: Using BaseJobId(s): 1
14-Oct 23:08 bacula-dir JobId 21: Sending Accurate information to the FD.

And the following on completion:

14-Oct 23:10 bacula-fd JobId 21: Space saved with Base jobs: 1665 MB