Search:  
Gentoo Wiki

HOWTO_Heartbeat_and_DRBD

This article is part of the HOWTO series.
Installation Kernel & Hardware Networks Portage Software System X Server Gaming Non-x86 Emulators Misc

Contents

Introduction

Hardware fails. Software fails. You can buy the best, most durable hardware with redundant power supply and failover nics, still there's always the risk of a single point of failure if a service depends on hardware that exists only once like the mainboard, CPU, RAM,... To minimize the risk of a total service breakdown it's a good approach to setup an identical second system so in case of a failure of the first system the second system can take over the functionality and provide the service. This is called a High-Availability-Cluster(HA). To automize the tasks of checking the systems health and commiting the handoff in case of a failure the Linux-HA-Project has been created. It's main component is heartbeat, the software that sends "heartbeats" to the other nodes of the cluster and manages starting and stopping of failover-services on each node.

If you handle lots of data on a realtime basis, you'll face another problem, data integrity. Imagine you're running a database where records are written every second and used by e.g. a webinterface that shows your customer a status of his transactions in realtime. During normal operation, this data is written to the harddisk of the first node, but in case of a failover where a second node is taking over the database-service, the second node does not know the data written to the harddisk of the first node 5 seconds or even 5 minutes ago. To bypass this problem, the second node always has to have the same data as the first node. You could do this by copying the data to the second node on the file-level - but there are much more clever solutions that do the mirroring on device level, DRBD(Distributed Replicated Block Device) is one of them and imho very easy to configure, powerfull and solid. The drawback of DRBD currently is that you can only write and read from the primary(master) node, the partition you mirror can not be mounted on the secondary(slave) node, you have to unmount it on the primary node(node1), then inform the drbd client on the secondary node(node2) to run as primary and then mount the partition to read and write the data.

System Configuration

For this document I have setup 2 vmware workstations both running identical Gentoo installations. Each system has been setup with 2 network cards and 2 extra partitions. The primary network interface should be setup with a public static IP address and the secondary nic should have some type of private IP address. You will also need an additional public static IP address that will be used for the "service" IP address. Everything relying on the cluster as a whole should use this IP address for services.

File: testcluster1: /etc/conf.d/net
# External static interface.
config_eth0=( "192.168.0.101 netmask 255.255.255.0 brd 192.168.0.255" )
routes_eth0=( "default gw 192.168.0.1" )
dns_servers_eth0="4.2.2.1 4.2.2.2"
dns_domain_eth0="yourdomain.tld"

# This the heartbeat and disk syncing interface.
config_eth1=( "10.0.0.1 netmask 255.255.255.0 brd 10.0.0.255" )


File: testcluster2: /etc/conf.d/net
# External static interface.
config_eth0=( "192.168.0.102 netmask 255.255.255.0 brd 192.168.0.255" )
routes_eth0=( "default gw 192.168.0.1" )
dns_servers_eth0="4.2.2.1 4.2.2.2"
dns_domain_eth0="yourdomain.tld"

# This the heartbeat and disk syncing interface.
config_eth1=( "10.0.0.2 netmask 255.255.255.0 brd 10.0.0.255" )


File: both installations: /etc/hosts
# IPv4 and IPv6 localhost aliases
127.0.0.1            localhost.localdomain localhost
192.168.0.100         testcluster.yourdomain.tld testcluster
192.168.0.101         testcluster1.yourdomain.tld testcluster1
192.168.0.102         testcluster2.yourdomain.tld testcluster2

Installing and configuring DRBD

Preparing your HD for DRBD

If you want to use DRBD for mirroring, you should create an extra partition where you save the data you want to mirror to other nodes(e.g. /var/lib/postgresql for postgresql or /var/www for apache). Additionally to the mirrored data DRBD needs at least 128MB to save meta-data.

For this example I created an additional virtual disk and put 2 partitions on it, one for Apache and the other for MySQL.

Code: Partition table for DRBD
testcluster1 / # fdisk /dev/sdb

Command (m for help): p

Disk /dev/sdb: 2147 MB, 2147483648 bytes
255 heads, 63 sectors/track, 261 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00000000

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1               1         131     1052226   83  Linux
/dev/sdb2             132         261     1044225   83  Linux

I was used to cfdisk for partitioning my harddisk, but after i ran into problems to create partitions of the exactly same size on both nodes i started using the more precise fdisk. To specify a exact partition size, you should change the units to sectors by issuing the command "u"(As explained in the command snippet above). Then create the partitions as explained in the gentoo handbook.

Note : The size can be specified using fdisk with +128M when asked for the ending sector. KaZeR 13:53, 13 August 2007 (UTC)

Note : You can make exact copy of partition table by using sfdisk. Xarthisius 12:33, 16 March 2008 (UTC)

sfdisk -d /dev/sda | sfdisk /dev/sdb

Kernel Configuration

activate the following options

Device Drivers
 -- Connector - unified userspace <-> kernelspace linker

Cryptographic API
 -- Cryptographic algorithm manager

Installing, configuring and running DRBD

Please note that you need to do the following one each cluster. Install DRBD:

testcluster1 / # emerge -av drbd
testcluster2 / # emerge -av drbd

These are the packages that would be merged, in order:

Calculating dependencies... done!
[ebuild   R   ] sys-cluster/drbd-8.0.11  0 kB
[ebuild   R   ] sys-cluster/drbd-kernel-8.0.11  0 kB

Total: 2 package (2 reinstall), Size of downloads: 0 kB

Would you like to merge these packages? [Yes/No]     


After we've successfully installed DRBD we need to create the configuration file. When I installed I did not get a default config file so you many need to create your own. The following is the compete configuration. There is a lot of redundancy that might be better in the common section however at this time the installation works and I don't really want to mess with it.

It should be noted that "testcluster1" and "testcluster2" must match the hostname of the node.

File: /etc/drbd.conf

global {
        usage-count no;
}

common {
}


#
# this need not be r#, you may use phony resource names,
# like "resource web" or "resource mail", too
#

resource "drbd0" {
        # transfer protocol to use.
        # C: write IO is reported as completed, if we know it has
        #    reached _both_ local and remote DISK.
        #    * for critical transactional data.
        # B: write IO is reported as completed, if it has reached
        #    local DISK and remote buffer cache.
        #    * for most cases.
        # A: write IO is reported as completed, if it has reached
        #    local DISK and local tcp send buffer. (see also sndbuf-size)
        #    * for high latency networks
        #
        protocol C;

        handlers {
                # what should be done in case the cluster starts up in
                # degraded mode, but knows it has inconsistent data.
                #pri-on-incon-degr "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ; halt -f";

                pri-on-incon-degr "echo 'DRBD: primary requested but inconsistent!' | wall; /etc/init.d/heartbeat stop"; #"halt -f";
                pri-lost-after-sb "echo 'DRBD: primary requested but lost!' | wall; /etc/init.d/heartbeat stop"; #"halt -f";

                #pri-on-incon-degr "echo o > /proc/sysrq-trigger";
                #pri-lost-after-sb "echo o > /proc/sysrq-trigger";
                #local-io-error "echo o > /proc/sysrq-trigger";
        }

        startup {
                #The init script drbd(8) blocks the boot process until the DRBD resources are connected.  When the  cluster  manager
                #starts later, it does not see a resource with internal split-brain.  In case you want to limit the wait time, do it
                #here.  Default is 0, which means unlimited. The unit is seconds.
                wfc-timeout 0;  # 2 minutes

                # Wait for connection timeout if this node was a degraded cluster.
                # In case a degraded cluster (= cluster with only one node left)
                # is rebooted, this timeout value is used.
                #
                degr-wfc-timeout 120;    # 2 minutes.
        }

        syncer {
                rate 100M;
                # This is now expressed with "after res-name"
                #group 1;
                al-extents 257;
        }

        net {
                # TODO: Should these timeouts be relative to some heartbeat settings?
                # timeout       60;    #  6 seconds  (unit = 0.1 seconds)
                # connect-int   10;    # 10 seconds  (unit = 1 second)
                # ping-int      10;    # 10 seconds  (unit = 1 second)

                # if the connection to the peer is lost you have the choice of
                #  "reconnect"   -> Try to reconnect (AKA WFConnection state)
                #  "stand_alone" -> Do not reconnect (AKA StandAlone state)
                #  "freeze_io"   -> Try to reconnect but freeze all IO until
                #                   the connection is established again.
                # FIXME This appears to be obsoleate
                #on-disconnect reconnect;

                # FIXME Experemental Crap
                #cram-hmac-alg "sha256";
                #shared-secret "secretPassword555";
                #after-sb-0pri discard-younger-primary;
                #after-sb-1pri consensus;
                #after-sb-2pri disconnect;
                #rr-conflict disconnect;
        }

        disk {
                # if the lower level device reports io-error you have the choice of
                #  "pass_on"  ->  Report the io-error to the upper layers.
                #                 Primary   -> report it to the mounted file system.
                #                 Secondary -> ignore it.
                #  "panic"    ->  The node leaves the cluster by doing a kernel panic.
                #  "detach"   ->  The node drops its backing storage device, and
                #                 continues in disk less mode.
                #
                on-io-error   pass_on;

                # Under  fencing  we understand preventive measures to avoid situations where both nodes are
                # primary and disconnected (AKA split brain).
                fencing dont-care;

                # In case you only want to use a fraction of the available space
                # you might use the "size" option here.
                #
                # size 10G;
        }


        on testcluster1 {
                device          /dev/drbd0;
                disk            /dev/sdb1;
                address         10.0.0.1:7788;
                meta-disk       internal;
        }

        on testcluster2 {
                device          /dev/drbd0;
                disk            /dev/sdb1;
                address         10.0.0.2:7788;
                meta-disk       internal;
        }
}

resource "drbd1" {
        # transfer protocol to use.
        # C: write IO is reported as completed, if we know it has
        #    reached _both_ local and remote DISK.
        #    * for critical transactional data.
        # B: write IO is reported as completed, if it has reached
        #    local DISK and remote buffer cache.
        #    * for most cases.
        # A: write IO is reported as completed, if it has reached
        #    local DISK and local tcp send buffer. (see also sndbuf-size)
        #    * for high latency networks
        #
        protocol C;

        handlers {
                # what should be done in case the cluster starts up in
                # degraded mode, but knows it has inconsistent data.
                #pri-on-incon-degr "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ; halt -f";

                pri-on-incon-degr "echo 'DRBD: primary requested but inconsistent!' | wall; /etc/init.d/heartbeat stop"; #"halt -f";
                pri-lost-after-sb "echo 'DRBD: primary requested but lost!' | wall; /etc/init.d/heartbeat stop"; #"halt -f";

                #pri-on-incon-degr "echo o > /proc/sysrq-trigger";
                #pri-lost-after-sb "echo o > /proc/sysrq-trigger";
                #local-io-error "echo o > /proc/sysrq-trigger";
        }

        startup {
                #The init script drbd(8) blocks the boot process until the DRBD resources are connected.  When the  cluster  manager
                #starts later, it does not see a resource with internal split-brain.  In case you want to limit the wait time, do it
                #here.  Default is 0, which means unlimited. The unit is seconds.
                wfc-timeout 0;  # 2 minutes

                # Wait for connection timeout if this node was a degraded cluster.
                # In case a degraded cluster (= cluster with only one node left)
                # is rebooted, this timeout value is used.
                #
                degr-wfc-timeout 120;    # 2 minutes.
        }

        syncer {
                rate 100M;
                # This is now expressed with "after res-name"
                #group 1;
                al-extents 257;
        }

        net {
                # TODO: Should these timeouts be relative to some heartbeat settings?
                # timeout       60;    #  6 seconds  (unit = 0.1 seconds)
                # connect-int   10;    # 10 seconds  (unit = 1 second)
                # ping-int      10;    # 10 seconds  (unit = 1 second)

                # if the connection to the peer is lost you have the choice of
                #  "reconnect"   -> Try to reconnect (AKA WFConnection state)
                #  "stand_alone" -> Do not reconnect (AKA StandAlone state)
                #  "freeze_io"   -> Try to reconnect but freeze all IO until
                #                   the connection is established again.
                # FIXME This appears to be obsoleate
                #on-disconnect reconnect;

                # FIXME Experemental Crap
                #cram-hmac-alg "sha256";
                #shared-secret "secretPassword555";
                #after-sb-0pri discard-younger-primary;
                #after-sb-1pri consensus;
                #after-sb-2pri disconnect;
                #rr-conflict disconnect;
        }

        disk {
                # if the lower level device reports io-error you have the choice of
                #  "pass_on"  ->  Report the io-error to the upper layers.
                #                 Primary   -> report it to the mounted file system.
                #                 Secondary -> ignore it.
                #  "panic"    ->  The node leaves the cluster by doing a kernel panic.
                #  "detach"   ->  The node drops its backing storage device, and
                #                 continues in disk less mode.
                #
                on-io-error   pass_on;

                # Under  fencing  we understand preventive measures to avoid situations where both nodes are
                # primary and disconnected (AKA split brain).
                fencing dont-care;

                # In case you only want to use a fraction of the available space
                # you might use the "size" option here.
                #
                # size 10G;
        }


        on testcluster1 {
                device          /dev/drbd1;
                disk            /dev/sdb2;
                address         10.0.0.1:7789;
                meta-disk       internal;
        }

        on testcluster2 {
                device          /dev/drbd1;
                disk            /dev/sdb2;
                address         10.0.0.2:7789;
                meta-disk       internal;
        }
}

Don't forget to copy this file to both node locations.


Now its time to setup DRBD. Run the following commands on both nodes.

modprobe drbd

drbdadm create-md drbd0
drbdadm attach drbd0
drbdadm connect drbd0

drbdadm create-md drbd1
drbdadm attach drbd1
drbdadm connect drbd1


Now on the primary node run the following:

drbdadm -- --overwrite-data-of-peer primary drbd0
drbdadm -- --overwrite-data-of-peer primary drbd1

At this point a full synchronization should be occurring. You can monitor the progress with the following command.

testcluster1 # / watch cat /proc/drbd

version: 8.0.11 (api:86/proto:86)
GIT-hash: b3fe2bdfd3b9f7c2f923186883eb9e2a0d3a5b1b build by root@localhost, 2008-04-18 11:35:09
 0: cs:SyncSource st:Primary/Secondary ds:UpToDate/Inconsistent C r---
    ns:2176 nr:0 dw:49792 dr:2196 al:17 bm:0 lo:0 pe:5 ua:0 ap:0
       [>....................] sync'ed:  0.4% (1050136/1052152)K
       finish: 0:43:45 speed: 336 (336) K/sec
       resync: used:0/31 hits:130 misses:1 starving:0 dirty:0 changed:1
       act_log: used:0/257 hits:12431 misses:35 starving:0 dirty:18 changed:17
 1: cs:SyncSource st:Primary/Secondary ds:UpToDate/Inconsistent C r---
    ns:163304 nr:0 dw:33252 dr:130072 al:13 bm:8 lo:0 pe:5 ua:0 ap:0
       [=>..................] sync'ed: 14.6% (893640/1044156)K
       finish: 0:49:38 speed: 296 (360) K/sec
       resync: used:0/31 hits:13317 misses:16 starving:0 dirty:0 changed:16
       act_log: used:0/257 hits:8300 misses:13 starving:0 dirty:0 changed:13


Depending on your hardware and the size of the partition, this could take some time. Later, when everything is synced, the mirroring will be very fast. See DRBD-Performance for more information.


You can now use the /dev/drbd0 and /dev/drbd1 as normal disks even before syncing has finished. So lets go ahead and format the disks. Use what ever format you want. Do this on first node.

mke2fs -j /dev/drbd0
mke2fs -j /dev/drbd1

Now setup the primary and secondary nodes. Notice these commands are different for each node.

testcluster1 # / drbdadm primary all
testcluster2 # / drbdadm secondary all


Make sure your add the mount points to the fstab and they are set to noauto. Again this needs to be done on both nodes.

File: /etc/fstab
# <fs>                  <mountpoint>    <type>          <opts>          <dump/pass>

# NOTE: If your BOOT partition is ReiserFS, add the notail option to opts.
/dev/sda1               /boot           ext2            noatime         1 2
/dev/sda3               /               ext3            noatime         0 1
/dev/sda2               none            swap            sw              0 0
/dev/cdrom              /mnt/cdrom      audo            noauto,ro       0 0
#/dev/fd0               /mnt/floppy     auto            noauto          0 0
/dev/drbd0              /wwwjail/siteroot       ext3    noauto          0 0
/dev/drbd1              /wwwjail/mysql          ext3    noauto          0 0

proc                    /proc           proc            defaults        0 0

# glibc 2.2 and above expects tmpfs to be mounted at /dev/shm for
# POSIX shared memory (shm_open, shm_unlink).
# (tmpfs is a dynamically expandable/shrinkable ramdisk, and will
#  use almost no memory if not populated with files)
shm                     /dev/shm        tmpfs           nodev,nosuid,noexec     0 0

Time to create mount points, both nodes again.

mkdir -p /wwwjail/siteroot
mkdir -p /wwwjail/mysql

You can mount them on the first node:

mount /wwwjail/siteroot
mount /wwwjail/mysql
[edit: article mentionned to mount drives on BOTH nodes. this is apparently false unless you use dual-primary mode and the GFS filesystem [1] -David]

MySQL should already be installed but we need to configure it to use the DRBD device. We do that by simply putting all the databases and logs in /wwwjail/mysql. In a production environment, you'd probably break out logs, database and index files onto different devices. Since this is an experimental system, we'll just put everything into one resource.

Make sure no bind address is set because we need to bind to all interfaces and then limit access with iptables if need be. This needs to go on both nodes.

File: /etc/mysql/my.cf
...
#bind-address                           = 127.0.0.1
...
#datadir                                        = /var/lib/mysql
datadir                                         = /wwwjail/mysql
...

Now we need to install a mysql database to the shared drive. Issue the following command on both nodes.

# mysql_install_db

Ok if everything has gone well to this point you can add DRBD to the startup items for both nodes.

rc-update add drbd default
 Important:  DO NOT ADD drbd TO YOUR KERNELS AUTO LOADED MODULES!  IT WILL CAUSE ISSUES WITH HEARTBEAT.

I am not sure if you need to wait for the sync to finish at this point but it might be a good idea. After syncing (unless your brave) you should be able to start DRBD normally:

testcluster1 / # /etc/init.d/drbd start
testcluster2 / # /etc/init.d/drbd start

The DRBD service should automatically load the drbd-kernel module automatically:

testcluster1 etc # lsmod
Module                  Size  Used by
drbd                  142176  1 


You can revert the roles again to verify if it is syncing in both ways, when you're fast enough, you can issue these commands in a few seconds but you'll only see that drbd was faster ;-)


When you start a reboot during testing, you will have to issue

# drbdadm primary all

on the node were you want the data to be retrieved from, as drbd does not remember the roles(for drbd both nodes are equal). We will do this automatically with heartbeat later.


If you have different data across each node then you may have a split brain. To fix this run the following command. Note this assumes that testcluster1 is more up to date than testcluseter2. If the opposite is true reverse the commands for each.

testcluster1: drbdadm connect all
testcluster2: drbdadm -- --discard-my-data connect all

Installing and Configuring Heartbeat

Heartbeat is based on init scripts so setting it up todo advanced things is not that difficult but its not going to be covered in this doc.

Again most of the information here needs to be run on both nodes. Go ahead and emerge heartbeat.

emerge -av heartbeat

These are the packages that would be merged, in order:

Calculating dependencies... done!
[ebuild   R   ] sys-cluster/heartbeat-2.0.7-r2  USE="-doc -ldirectord -management -snmp" 0 kB


All of the heartbeat config is done in /etc/ha.d/. Again most of the important config files were not included by default on install so you will need to create them.

File: /etc/ha.d/ha.cf
# What interfaces to heartbeat over?
#udp  eth1
bcast eth1

# keepalive: how many seconds between heartbeats
keepalive 2

# Time in seconds before issuing a "late heartbeat" warning in the logs.
warntime 10

# Node is pronounced dead after 30 seconds.
deadtime 15

# With some configurations, the network takes some time to start working after a reboot.
# This is a separate "deadtime" to handle that case. It should be at least twice the normal deadtime.
initdead 30

# Mandatory. Hostname of machine in cluster as described by uname -n.
node    testcluster1
node    testcluster2


# When auto_failback is set to on once the master comes back online, it will take
# everything back from the slave.
auto_failback off

# Some default uid, gid info, This is required for ipfail
apiauth default uid=nobody gid=cluster
apiauth ipfail uid=cluster
apiauth ping gid=nobody uid=nobody,cluster

# This is to fail over if the outbound network connection goes down.
respawn cluster /usr/lib/heartbeat/ipfail

# IP to ping to check to see if the external connection is up.
ping 192.168.0.1
deadping 15

debugfile /var/log/ha-debug

# File to write other messages to
logfile /var/log/ha-log

# Facility to use for syslog()/logger
logfacility     local0


The haresources is probably the most important file to configure. This lists what init scripts need to be run and the parameters to pass to the script. The path for scripts are /etc/ha.d/resource.d/ followed by /etc/init.d.

Please note the init scripts need to follow Linux Standard Base Core Specification specifically with the function return codes.

It should also be noted that im not 100% sure this is the correct format for this file.

File: /etc/ha.d/haresources
testcluster1 IPaddr::192.168.0.100
testcluster1 drbddisk::drbd0 Filesystem::/dev/drbd0::/wwwjail/siteroot::ext3::noatime apache2
testcluster1 drbddisk::drbd1 Filesystem::/dev/drbd1::/wwwjail/mysql::ext3::noatime mysql

For example the IPaddr::192.168.0.100 will run the /etc/ha.d/resource.d/IPaddr script that will create a IP Alias on eth0 with the ipaddress 192.168.0.100.

Warning: The contents of the haresources file must be exactly the same on both nodes!

drbddisk will run the drbdadmin primary drbd0 and Filesystem is basically just a mount.


The last file tells heartbeat how to communicate with the other nodes. Because this example was based on a simulated crossover cable setup to connect the 2 nodes we will just use a crc check.

File: both nodes: /etc/ha.d/authkeys
auth 1
1 crc

If you plan on sending the heartbeat across the network you should use something a little stronger than crc. The follow is configuration for sha1.

File: both nodes: /etc/ha.d/authkeys
auth 1
1 sha1 MySecretPassword

Finally because the /etc/ha.d/authkeys file may contain a plain text password please setup permissions on both nodes.

# chown root:root /etc/ha.d/authkeys
# chmod 600 /etc/ha.d/authkeys

References

http://www.drbd.org/users-guide/s-heartbeat-config.html
http://www.drbd.org/fileadmin/drbd/doc/8.0.2/en/drbd.conf.html
http://www.linux-ha.org/HeartbeatTutorials
http://www.linux-ha.org/GettingStarted
http://www.linuxjournal.com/article/5862
http://www.linuxjournal.com/article/9074
http://cvs.linux-ha.org/viewcvs/viewcvs.cgi/linux-ha/doc/GettingStarted.html?rev=1.29
Retrieved from "http://www.gentoo-wiki.info/HOWTO_Heartbeat_and_DRBD"

Last modified: Mon, 08 Sep 2008 21:44:00 +0000 Hits: 18,263