Tech Blog

In this post I am going to create an active/active high-availability iSCSI cluster using CentOS 7, pacemaker, corosync, pcs/pcsd, and DRBD.  I am not going to cover what each of these components do as I covered that in my previous article CentOS 7 active/passive Apache Cluster - Part 1 but I would recommend giving it a read.  I also want to mention that this setup is based on a technical publication found on the Linbit site, the makers of DRBD, at http://www.linbit.com/en/resources/technical-publications/.  There is additional information in that document as well as other technical publications that are well worth reading so I suggest everyone sign up and have a look at them by heading over to the Linbit site.  I am putting my own spin on the instructions plus I am using the pcs shell instead of crm since it is now used in RHEL/CentOS 7.

The idea here is rather ingenious.  There will be two DRBD resources running in active/passive mode.  When everything is healthy, one will be located on one node and the second on the other.  If one system fails or needs to be taken off-line, the second node should have the resources to take over for the first.  That also requires two VIPs and two iSCSI initiators, again one per node.  We will be utilizing LVM with one volume group per DRBD resource but you could use more as does the Linbit document.  We need the cluster system to handle the LVM resources as well as all iSCSI chores.  I will be using an isolated LAN segment specifically for the cluster traffic.

My setup consists of the following:

- node5 - CentOS 7 virtual machine with two NICs, one with an IP of 192.168.1.170 and one with 172.16.0.1
- node6 - CentOS 7 virtual machine with two NICs, one with an IP of 192.168.1.171 and one with 172.16.0.2
- VIP1 - A floating virtual IP of 192.168.1.172
- VIP2 - A floating virtual IP of 192.168.1.173
Each machine has two 10 GB partition (in my case /dev/xvdb and /dev/xvdc) that will be used for the clustered data drives.  Note also that the IP addresses in the 172.16.0 subnet will be used for cluster communications while the 192.168.1 subnet is intended for client access to the iSCSI LUNs.

Once we have this all setup, we should be able to use an iSCSI initiator (i.e. a client) to connect to the two virtual IP (VIP) addresses.  When both nodes are healthy, we would ideally have each serving up a separate VIP along with a separate set of data on the two DRBD devices.  If one were to fail or need maintenance, the other would take over its services and resources.

Let's start by making sure that everything resolves as it should by running the following command on both nodes and note you may have to hit <ENTER> after the EOF below and make sure to change your names/address as needed: 

cat <<'EOF' >> /etc/hosts
192.168.1.170 node5 node5.theharrishome.lan
192.168.1.171 node6 node6.theharrishome.lan
172.16.0.1 node5-ha node5-ha.theharrishome.lan
172.16.0.2 node6-ha node6-ha.theharrishome.lan
192.168.1.172 vip1 vip1.theharrishome.lan
192.168.1.173 vip2 vip2.theharrishome.lan
EOF

Before moving forward, verify that each can ping everything by name. Here is what my /etc/hosts file now looks like on both nodes:
hosts

And now to install some software on both nodes:
# yum -y install corosync pacemaker pcs ntp policycoreutils-python epel-release

And now that we have epel installed, we can install the iSCSI Target Utilities package on both nodes:
# yum -y scsi-target-utils

We need to make sure the tgtd service is enabled and started on both nodes.  Actually, I find this a bit odd as usually the cluster software handles the starting/stoping of services and in fact you usually DON'T want to enable a clustered service to start but in this case it is necessary:
# systemctl enable tgtd;systemctl start tgtd

Let’s make sure we have accurate time on each system. We’ll configure NTP to take care of that as follows:
# systemctl enable ntpd
# ntpdate pool.ntp.org
# systemctl start ntpd

I am leaving SELinux enabled so we need to tell it that DRBD is legit with the following command on both nodes:
# semanage permissive -a drbd_t

We are also leaving the firewall (firewalld) on so we need to allow our cluster traffic through.  We could allow them one at a time but the following command will take care of many of the services so let's make use of it:
# firewall-cmd --permanent --add-service=high-availability

And for iSCSI target:
# firewall-cmd --permanent --add-service=iscsi-target

Now we need to allow DRBD traffic:
# firewall-cmd --permanent --add-port=7788-7799/tcp

Now let’s reload the firewall with our new settings:
# firewall-cmd --reload

Now we need to install a repository named ELrepo so we can install the tools necessary to work with DRBD.  Actually DRBD is part of modern Linux kernels such as the one shipped with CentOS 7 but unfortunately the tools are not.
# rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org

Then install the repo:
# rpm -Uvh http://www.elrepo.org/elrepo-release-7.0-2.el7.elrepo.noarch.rpm

And install the necessary DRBD tools:
# yum -y install drbd84-utils kmod-drbd84

We are ready to configure DRBD.  We are going to configure 2 DRBD resources so each node will have one when it is up and healthy.  Copy the following config file to /etc/drbd.d/r0.res MAKING SURE to change the host names as well as volume info to match your setup.  Note the use of IP addresses instead of names.  That is a requirement. 

resource iscsivg01 {
on node5.theharrishome.lan {
volume 0 {
device /dev/drbd0;
disk /dev/xvdb;
flexible-meta-disk internal;
}
address 172.16.0.1:7788;
}
on node6.theharrishome.lan {
volume 0 {
device /dev/drbd0;
disk /dev/xvdb;
flexible-meta-disk internal;
}
address 172.16.0.2:7788;
}
}

And for the second resource, we will use /etc/drbd.d/r1.res which contains the following and notice this resource uses a different port (7789) than the first one:

resource iscsivg02 {
on node5.theharrishome.lan {
volume 0 {
device /dev/drbd1;
disk /dev/xvdc;
flexible-meta-disk internal;
}
address 172.16.0.1:7789;
}
on node6.theharrishome.lan {
volume 0 {
device /dev/drbd1;
disk /dev/xvdc;
flexible-meta-disk internal;
}
address 172.16.0.2:7789;
}
}

Now let’s initialize the DRBD metadata on both resources on both nodes:
# drbdadm create-md iscsivg01
# drbdadm create-md iscsivg02

Here is what that looks like on one of mine:
drbd metadata initialize

Now we need to load the DRBD kernel module on both nodes:
# modprobe drbd

To set the module to load at boot:
# echo drbd >> /etc/modules-load.d/drbd.conf

And bring both resources up on both nodes:
# drbdadm up iscsivg01
# drbdadm up iscsivg02

Now we need to set one of the nodes as the primary for both resources. To do that from node1:
# drbdadm primary --force iscsivg01
# drbdadm primary --force iscsivg02

It should now begin to synchronize and this will probably take a while. You can always check the status as follows:
# cat /proc/drbd

Here is what mine looks like while it is synchronizing:
drbd sync

When the sync is finished, Primary/Secondary will show UpToDate/UpToDate instead of UpToDate/Inconsistent as shown above.

Now we need to run corosync-keygen so that the communication between nodes is encrypted. We only need to do this on one node as we will then copy the key to the other node. The command to generate the key is as follows and note also that it may take a few minutes to complete:
# corosync-keygen

You will be asked to press keys on your keyboard to generate entropy. Then wait . . .

Time to create the corosync config file of /etc/corosync/corosync.conf.  The installation comes with an example config file at /etc/corosync/corosync.conf.example that you could copy and modify but to get started, here is an example of the one I am using.  Note that I am using rrp_mode of passive and two rings for redundancy with the primary being an address used specifically for cluster traffic.  Make sure to change the entries as noted to match your environment.

totem {
  version: 2
  secauth: off
  cluster_name: cluster1
  transport: udpu
  rrp_mode: passive
}
nodelist {
  node {
    ring0_addr: node5-ha.theharrishome.lan
    ring1_addr: node5.theharrishome.lan
    nodeid: 1
  }
  node {
    ring0_addr: node6-ha.theharrishome.lan
    ring1_addr: node6.theharrishome.lan
    nodeid: 2
  }
}
quorum {
  provider: corosync_votequorum
  two_node: 1
}
logging {
  to_logfile: yes
  logfile: /var/log/cluster/corosync.log
  to_syslog: yes
}

You will need to copy this to each node along with the key generated above which should be at /etc/corosync/authkey. The command I used from node5 was as follows:
# scp /etc/corosync/corosync.conf /etc/corosync/authkey This email address is being protected from spambots. You need JavaScript enabled to view it.:/etc/corosync/

Now let's enable and start the pcsd service on both nodes:
# systemctl enable pcsd.service;systemctl start pcsd.service

Let's change the password for the hacluster user account that was created when pcs was installed. The standard passwd command can be used. It is a good idea to set the hacluster password the same on both nodes. So the following should do the trick:
# passwd hacluster

Now we need to authenticate pcs to both nodes using the hacluster user and the password we just set. We do this from a single node:
# pcs cluster auth node5-ha.theharrishome.lan node6-ha.theharrishome.lan

If you see any errors in the output, good places to begin looking would include the firewall and to verify pcsd is running.  Here is what mine looks like:
pcs auth

At this point, we should have all we need for pcs such that any change we make via the pcs command will make the necessary changes on both nodes. So let’s set up our cluster from a single node as follows:
# pcs cluster setup --name cluster1 node5-ha.theharrishome.lan node6-ha.theharrishome.lan --force

And a view of mine:
pcs cluster setup

Now we should be able to start the cluster.  The following command ran from one node should start all services on both nodes:
# pcs cluster start --all

Some people recommend NOT setting corosync and pacemaker to run on startup but I have always preferred they start automatically.  The following command will set them both to run at startup:
# systemctl enable corosync;systemctl enable pacemaker

Now we should be able to see the status of our new cluster with the following:
# pcs status

Here is what mine looks like:
pcs status1

Notice the warning at the top that says "WARNING: no stonith devices and stonith-enabled is not false".  I have covered stonith in previous posts so for now, I will just disable it with the following command on either node:
# pcs property set stonith-enabled=false

While we are at it, we need to also set the no-quorum-policy to ignore since we only have two nodes and a failure in one will cause a loss of quorum:
# pcs property set no-quorum-policy=ignore

That should get rid of the warning.  Now we can focus on setting up the logical volumes.  As mentioned in the introduction, I am going to use one logical volume per DRBD resource so we will have a total of two with one being on each node when the cluster is healthy.  You could add more as you see fit.  Here are the commands I used to set mine up and note this is done from the node that shows as primary for the DRBD resources (issue command cat /proc/drbd to find out which that is):
# vgcreate iscsivg01 /dev/drbd0
# vgcreate iscsivg02 /dev/drbd1
# lvcreate -l 100%FREE -n lun1 iscsivg01
# lvcreate -l 100%FREE -n lun1 iscsivg02

Let's go ahead and format our two logical volumes:
# mkfs.ext4 /dev/mapper/iscsivg01-lun1
# mkfs.ext4 /dev/mapper/iscsivg02-lun1

We need to figure out what volume groups are configured for local storage.  We can do that by issuing the following command:
# vgs --noheadings -o vg_name

Here is my output:
vgs output

You can see that my local install of CentOS 7 makes use of a volume group named "centos".  The other two listed are the volume groups I created above.  We want the cluster to handle all volumes EXCEPT those for local group.  The following entry in /etc/lvm/lvm.conf will take care of that for me and note that if you had multiple volumes, you could add them with a comma betweeen entries as shown in the examples within the file:

volume_list = [ "centos" ]

Now we need to tell LVM to read the physical volume signatures from the DRBD device and not the block devices that make it up.  So we edit /etc/lvm/lvm.conf and modify the filter setting.  All of my devices start with /dev/xvd so my entry is as follows on both nodes:
filter = [ "r|/dev/xvd.*|" ]

It is also recomended to disable the LVM write cache in /etc/lvm/lvm.conf which is done with the following setting:
write_cache = 0

Now issue the following command on both nodes to ensure that locking_type is set to 1 and that use_lvmetad is set to 0 in the /etc/lvm/lvm.conf file. This command also disables and stops any lvmetad processes immediately.
# lvmconf --enable-halvm --services --startstopservices

Finally, let's deactiave the two LVM volume groups:
# vgchange -an iscsivg01
# vgchange -an iscsivg02

And after all that setup, we can finally begin to create our cluster resources.  I am not going into detail for each resource as I think it is relatively straight forward to figure out what each does.  First, we will create the physical DRBD resources:
# pcs resource create p_drbd_iscsivg01 ocf:linbit:drbd drbd_resource=iscsivg01 op monitor interval=30s
# pcs resource create p_drbd_iscsivg02 ocf:linbit:drbd drbd_resource=iscsivg02 op monitor interval=30s

Now we need to clone them since they must run on both nodes with only one being the master of each:
# pcs resource master ms_drbd_iscsivg01 p_drbd_iscsivg01 master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true
# pcs resource master ms_drbd_iscsivg02 p_drbd_iscsivg02 master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true

If we issue the pcs status command as shown below, we may now see our new resources in a failed state:
pcs drbd failed

A cleanup of the resources using the following two command should clean things up:
# pcs resource cleanup ms_drbd_iscsivg01
# pcs resource cleanup ms_drbd_iscsivg02

Here is what mine looks like after those two commands:
pcs drbd started

Lets create our virtual IPs next and add a resource group for each to be a member of (rg_iscsivg01 & rg_iscsivg02) making it much easier to manage later:
# pcs resource create p_ip_vip1 IPaddr2 ip=192.168.1.172 cidr_netmask=24 --group rg_iscsivg01
# pcs resource create p_ip_vip2 IPaddr2 ip=192.168.1.173 cidr_netmask=24 --group rg_iscsivg02

Next we add an LVM resource for both volume groups while also adding them to our respective resource groups:
# pcs resource create p_lvm_iscsivg01 LVM volgrpname=iscsivg01 exclusive=true --group rg_iscsivg01
# pcs resource create p_lvm_iscsivg02 LVM volgrpname=iscsivg02 exclusive=true --group rg_iscsivg02

Now we need to make sure the LVM resources are colocated on the same node that their corrosponding DRBD resource is on:
# pcs constraint colocation add rg_iscsivg01 with ms_drbd_iscsivg01 INFINITY with-rsc-role=Master
# pcs constraint colocation add rg_iscsivg02 with ms_drbd_iscsivg02 INFINITY with-rsc-role=Master 

We need to tell the cluster to create our iSCSI targets.  I am adding CHAP cerdentials that are unique to each target but this is not necessary and do make sure to change them do fit your needs:
# pcs resource create p_target_iscsivg01 ocf:heartbeat:iSCSITarget iqn="iqn.2017-02.lan.theharrishome:storage.iscsivg01" tid="1" incoming_username="iscsi1" incoming_password="password" op monitor interval="10s" --group rg_iscsivg01
# pcs resource create p_target_iscsivg02 ocf:heartbeat:iSCSITarget iqn="iqn.2017-02.lan.theharrishome:storage.iscsivg02" tid="2" incoming_username="iscsi2" incoming_password="password" op monitor interval="10s" --group rg_iscsivg02

Now we can create our iSCSI LUN targets:
# pcs resource create p_lu_iscsivg01_lun1 ocf:heartbeat:iSCSILogicalUnit target_iqn="iqn.2017-02.lan.theharrishome:storage.iscsivg01" lun="1" path="/dev/iscsivg01/lun1" op monitor interval="10s" --group rg_iscsivg01
# pcs resource create p_lu_iscsivg02_lun1 ocf:heartbeat:iSCSILogicalUnit target_iqn="iqn.2017-02.lan.theharrishome:storage.iscsivg02" lun="1" path="/dev/iscsivg02/lun1" op monitor interval="10s" --group rg_iscsivg02

Now we want to make sure that rg_iscsivg01 prefers to run on node 3 and rg_iscsivg02 prefers node 4:
# pcs constraint location rg_iscsivg01 prefers node5-ha.theharrishome.lan=50
# pcs constraint location rg_iscsivg02 prefers node6-ha.theharrishome.lan=50

And finally, lets make sure that each DRBD resource is promoted before the resources in it's associated resource group are started:
# pcs constraint order promote ms_drbd_iscsivg01 then start rg_iscsivg01
# pcs constraint order promote ms_drbd_iscsivg02 then start rg_iscsivg02

Now if we clean up the resources with the following command, hopefully everything will be started:
# pcs resource cleanup

We could also chose to cleanup only a specific resource as follows for p_lu_iscsivg02_lun1:
# pcs resource cleanup p_lu_iscsivg02_lun1

Or even an entire resource group:
# pcs resource cleanup rg_iscsivg02

Here is what mine looks like:
pcs status2

To view the constraints we put in place we use the command:
# pcs constraint

Here is what mine looks like:
pcs constraint

It is important to realize that all resources will be started in the specified order (the order they were entered) and they will be stopped in the opposite order.  The order can be changed later if needed (google 'pcs constraint order' for info on how to do that).  It is also important to understand that all resources in a group will remain "colocated".  That means that all resources in resource group rg_iscsivg01 will always run on the same machine and all resources in resource group rg_iscsivg02 will always run on the same machine.  

If this setup were used in production, you would likely want to put some security restrictions in place.  The technical publication from Linbit also discusses that aspect of this setup.

Let's take a look at some important command that you may need to manage your cluster.

To back up the cluster configuration:
# pcs config backup [filename]

To restore a backup (--local will only restore to the current node) and note the cluster must be stoped before restoring (pcs cluster stop --all):
# pcs config restore [--local] [filename]

To move a resource from one node to the other:
# pcs resource move rg_iscsivg01 node5-ha.theharrishome.lan

After a move, you may notice that a constraint is added that only allows the resource to run on the node it was just moved to.  To clear that:
# pcs resource clear rg_iscsivg01

A better way to move a resource is by putting one in standby.  This comes in handy if a node needs to be rebooted and does not add a constraint as does moving a resource as shown above:
# pcs cluster standby node5-ha.theharrishome.lan

When you are done with your maintenance, you will need to remove it from standby as follows:
# pcs cluster unstandby node5-ha.theharrishome.lan

If you need to do any type of maintenance on the cluster nodes, you should put the cluster in maintenance mode first.  This allows for modifications without affecting the running cluster services:
# pcs property set maintenance-mode=true

After putting a cluster in maintenance mode, you should verify that the setting took affect with the following command before making any adjustments:
# pcs property show maintenance-mode

To view the properties of a resource:
# pcs resource show p_ip_vip1

To change the property of a resource
# pcs resource update p_ip_vip1 ip=192.168.1.177

To remove a resource (p_lu_iscsivg02_lun1):
# pcs resource delete p_lu_iscsivg02_lun1

As mentioned in my previous cluster post, the pcsd daemon has a nice web interface that runs on port 2224 which you can use if you enable it through the firewall.  Just point your web browser at one of your nodes such as https://192.168.1.170:2224 and you should then be prompted for the user/pass.  Use the hacluster user and whatever password you set for that user.  Then you need to add the nodes themselves to the interface to begin using it as it doesn't automatically add them.

The pcs shell used to manage the cluster has many more features than I am touching on here.  One of the best resources can be found on the Red Hat site at https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/High_Availability_Add-On_Reference/ch-pcscommand-HAAR.html.  If you are going to set up and manage such a cluster, it is well worth your time to dig into it much deeper.

In the next post, I am going to use this same setup to enlarge the size of the DRBD storage while it is online so check back for that.  Thanks for reading!

- Kyle H.