50,589 views
Juniper ScreenOS : Active/Passive clustering
Introduction
In this blog post, I’ll show the easy steps to set up a screenOS based active/passive cluster. I’m not going to discuss the configuration of active/active clusters because, in my opinion, this configuration is only needed in rare circumstances and may introduce some weird behaviour issues. Furthermore, active/passive clusters have been working quite well for me.
These are the main requirements to set up a cluster :
- The 2 devices need to be the same model
- The 2 devices need to run exactly the same screenOS version (Or you’ll get “configuration out of sync” messages because the checksums will fail)
- The 2 devices need to be connected to each other : You need to have at least one free interface on each device to interconnect the device (HA link). Use the same interface number on both devices.
- It makes no real sense to build a cluster if your switches are not redundant as well. Having a cluster on one switch will bring some redundancy, but the switch becomes a single point of failure. Just something to keep in mind.
- License : SSG5 devices require an additional license. SSG140 and other models have the NSRP license included. (check the documentation !). Run “get lic | incl NSRP”. If the output states “NSRP: Active/Active” then you can set up an Active/Active cluster, and an Active/Passive cluster.
In addition to this, you’ll need additional IP addresses because you will need to set separate management IP addresses on both devices. This will not only allow you to connect to each device separately, but it is also a requirement for track-ip (when used) and for the cluster to operate properly. These management IP’s are not replicated between devices. Login banners are not replicated either.
In fact, I usually put separate management IP’s on every interface that has management enabled. The interface IP will be the same on each cluster member so you’ll only need one IP per interface.
The procedures below are based on screenOS 6.2, but it should work with earlier versions as well.
Terminology
Before looking at the configuration steps, I need to explain some cluster terminology :
- NSRP : Netscreen Redundancy Protocol : This is the protocol used by Netscreen to set up and operate a cluster
- VSD : Virtual Security Device : this is the logical representation of a firewall device. If you set up a cluster, both devices run the same VSD. This means that they have the same configuration. In a VSD, only one device actively runs the VSD (=”Master”) (and the other one is the backup). To end users, traffic always uses the VSD, not one or another physical device.
- VSI : Virtual Security Interface : VSI’s overlay physical interfaces so they can move the active VSD from one device to another.
- VSD Group : pair of security devices that are contained in the same VSD.
- VSD States : A VSD can be in any of the following states :
- Master : this refers to the active node, the device that processes traffic sent to the VSI
- Backup : this refers to the passive node. This device monitors the state of the master and takes over when the master fails
- Initial : state of a VSD group member when it is being joined to the VSD group
- Ineligible : state assigned by an admin so it cannot particiate in the election process (more info about this process later)
- Inoperable : state of a member that cannot be joinged to the VSD because there is an internal problem or network connection problem.
- Priority & Election : Upon initial NSRP configuration, the VSD group member that has the lowest (closest to zero) priority number will become the master device. If two devices have the same priority, the device that has the lowest mac address will win.
- Preempt : By default, the master/backup device election is purely based on priorities. But there may be a reason where you want to control the election. Let’s say the master device goes down. The backup device takes control. Suppose the master device is broken and you need to replace it (and the replacement unit has a lower mac address). So you reconfigure the new device with the same VSD information, same priority. You connect the master device back and all of a sudden both devices have an empty config. When you put the master device back, with the same priority, and a lower MAC, it will become master again… But it does not have the config yet. So it pushes its (empty) config to the other device and the entire cluster is broken. This scenario can be avoided by setting different priorities, or by manually setting the node that has the entire config to preempt mode. This will ensure that this node (with preempt enabled) will become the master even if the other node has a lower priority. So in the scenario where the previous master broke down and was replaced, I would put the active node (the backup node at that point) in preempt and then reconnect the new node into the cluster.
- The preempt holddown parameter specifies how long a device will wait for another device with higher priority to assume the master role before it takes over. The default is 3 seconds.
- Failover : when the master device goes down, the backup device will send ARP messages from each VSI to inform the connected switches (Layer 2) about the new location of the VSI MAC address. Because these ARP messages are not required for address resolution, they are called “gratuitous ARP” messages. You can control how many ARP packets are sent upon failover.
- RTO : Run-Time objects : objects created dynamically in memory (such as session table entries, ARP cache, DHCP leases, IPSec SA’s). When a failover occurs, the new master needs to maintain these RTO’s to allow a smooth failover. By default, NSRP cluster members only synchronize sessions. You can sync the other RTO’s as well, but you should only enable RTO sync when the configurations are in sync first.
- HA link : this connection is used for heartbeat and to synchronize configs, rto’s, … between the members of the cluster. You can interconnect the devices with a crosscable or with a switch between the 2 devices. The devices need to be in the same layer 2 network (so you cannot cross a router with a HA link). It’s recommended to always use a secondary HA interface (which can be a regular firewall interface that is used for traffic). In normal operations, only the primary HA link is used for heartbeat and to synchronize RTO’s and config. The secondary is only used for heartbeat (unless the primary HA link goes down)
- HA link protection : you can enable encryption and authentication on the HA link.
When you want to build an A/P cluster, you need a single VSD. With older versions of screenOS, you could not synchronise dynamic routing between the devices, so you needed to set up a VSD-Less cluster. With screenOS 6 and up, this is no longer needed. Routing entries (both static and dynamic) can be synchronized as well. The static routes will simply be put in the RIB as static routes, the dynamic routes will be marked with a trailing ‘B’ (Backup route). So OSPF routes will be displayed as “OB”, iBGP will be displayed as “iBB”, eBGP will be displayed as “eBB” and so on
Configuring NSRP Active/Passive
Before looking at the configuration steps, it’s important to know that you can convert a fully working firewall into a cluster without any downtime. You don’t need to reboot. You just have to make sure all interfaces on both devices are used for the same zone/link/…
First, pick the interface on both devices to be used as HA link. Let’s say you want to use eth0/6 for HA. So on both cluster devices, put this interfaces in nsrp mode :
set nsrp interface eth0/6
(on some devices, you need to put the interface in the HA zone instead : set int e0/6 zone HA)
Set up master device
Next, create the cluster on the first device. We will create a cluster id 1, name it “MyCluster1” and set some cluster parameters.
set nsrp cluster id 1 set nsrp cluster name MyCluster1 set nsrp arp 5 set nsrp auth password MyAuthPassword set nsrp encrypt password MyEncryptionKey
When you enter the “set nsrp cluster id 1”, you will get the message “Unit becomes master of NSRP vsd-group 0”. The prompt now indicates that the device is master (M)
As soon as you enter the “set nsrp cluster name” command, the command prompt will also indicate the cluster name. On the master device, the prompt will look like this :
MyCluster1:hostname(M)->
On a backup device, you’ll see MyCluster1:hostname(B)->
The arp, auth and encrypt statements are optional. The “arp” statement refers to the number of gratuitous arp messages that need to be sent upon failover. The default is 4.
In the current setup, the device can failover when the other device goes down. If you want devices to failover when interfaces go down, you need to set interface monitoring :
set nsrp monitor interface eth0/1
This is optional and is only required if you want to do interface based failover. Keep in mind that not just the interface will failover. The entire device will failover.
Now it’s time to set some VSD specific settings (priority, preempt and preempt holddown). I usually configure the master device with a priority of 50 and enable preempt :
set nsrp vsd id 0 priority 50 set nsrp vsd id 0 preempt
Then, enable RTO sync and enable route sync
set nsrp rto-mirror sync set nsrp rto-mirror route
Define a secondary interface.
set nsrp secondary-path ethernet0/4
Set up backup device
On the backup device, the configuration is pretty much the same, except for the priority and preempt :
set nsrp cluster id 1 set nsrp cluster name MyCluster1 set nsrp arp 5 set nsrp auth password MyAuthPassword set nsrp encrypt password MyEncryptionKey set nsrp monitor interface eth0/1 set nsrp vsd id 0 priority 100 set nsrp rto-mirror sync set nsrp rto-mirror route set nsrp secondary-path ethernet0/4
=> on the backup node, I have set the priority to 100 (higher than the master) and I do not enable preempt. This will make sure that, if the master goes down (e.g. for maintenance) and comes online again, it will surely become the master node again.
When the cluster devices are configured, they will start synchronizing information. You can check if the configurations are in sync by running :
MyCluster1:fw01(M)-> exec nsrp sync global-config check-sum configuration in sync
Before the cluster is fully in sync, you should force sync, by running (on the backup device !):
MyCluster1:fw02(B)-> exec nsrp sync global-config save load peer system config to save Save global configuration successfully. Save local configuration successfully. done. Please reset your box to let cluster configuration take effect!System change state to Active(1) configuration in sync (local checksum 12345678 == remote checksum 12345678) Received all run-time-object from peer.
After the reboot of the passive (backup) device, the cluster is fully operational. Note : when the device prompts you to save the config, enter “n” (no)
Even if you have create a cluster on an existing device and added a second (new, empty) device into the cluster, you only have to reboot the passive node (and the active node always stays online)
When both cluster members are synced ( = when the files are synced), you should enable config sync.
set nsrp config sync
Verify cluster status
You can get some nsrp information with the following commands :
“get nsrp” : shows information about the cluster and cluster nodes, the vsd group, etc :
MyCluster1:fw01(M)-> get nsrp nsrp version: 2.0 cluster info: cluster id: 1, name: MyCluster1 local unit id: 8992891 active units discovered: index: 0, unit id: 8992891, ctrl mac: 00222386308b , data mac: ffffffffffff index: 1, unit id: 413691, ctrl mac: 0024ac04580a , data mac: ffffffffffff total number of units: 2 VSD group info: init hold time: 5 heartbeat lost threshold: 3 heartbeat interval: 1000(ms) master always exist: disabled group priority preempt holddown inelig master PB other members 0 50 yes 3 no myself 413691 total number of vsd groups: 1 Total iteration=4632604,time=88222673,max=29752,min=87,average=19 RTO mirror info: run time object sync: enabled route synchronization: enabled ping session sync: enabled coldstart sync done nsrp data packet forwarding is enabled nsrp link info: control channel: ethernet0/6 (ifnum: 10) mac: 00222386308b state: up ha data link not available secondary path channel: ethernet0/4 (ifnum: 8) mac: 00222386308d state: up NSRP encryption: enabled NSRP authentication: enabled device based nsrp monitoring threshold: 255, weighted sum: 0, not failed device based nsrp monitor interface: ethernet0/1 device based nsrp monitor zone: device based nsrp track ip: (weight: 255, disabled) number of gratuitous arps: 5 config sync: enabled track ip: disabled
If you want to see the differences between the 2 nodes, run
MyCluster1:fw01(M)-> exec nsrp sync global diff MyCluster1:fw01(M)-> rcv_sys_config_diff: get local config sucess Local have 0 different cmd lines: Peer have 0 different cmd lines:
Test failover
You can test if the cluster works by turning off the master device and wait a couple of seconds (typically 1 or 2 seconds) before the backup becomes master and processes all traffic.
Alternatively, you can perform a manual failover using the following command :
On the master device :
- If preempt is enabled, run :
exec nsrp vsd-group 0 mode ineligible
- If preempt is not enabled, run :
exec nsrp vsd-group 0 mode backup
These commands will force the primary (master) device to step down. The other device will become master right away
You can verify which one of the devices is master by performing the routines explained at http://kb.juniper.net/KB11199
Impact of cluster on certificates and snmp
Cluster members can have different hostnames. If digital certificates/snmp settings are configured with individual hostnames, then communication may break upon a failover. It’s better to set a cluster name for all members and to use this VSD identity for snmp and digital certificates.
You can set a cluster name with the following command :
set nsrp cluster name MyGlobalClusterName
For snmp, use the cluster name :
set snmp name MyGlobalClusterName
Also, It is important to install/sync all PKI related components on both devices before installing the cluster, or you may get “config out of sync” messages. (exec nsrp sync pki…)
See Concepts&Examples PDF [27 MB], page 1864
Check http://kb.juniper.net/KB11326 for more reasons why configurations can get out of sync
© 2009, Peter Van Eeckhoutte (corelanc0d3r). All rights reserved.