What You Should Know About Assessing OS400 High Availability Software
What to Look For In HA Software
There are several prominent OS400 High Availability (HA) software solutions offered to replicate and protect your critical data. Some are better than others but unfortunately, some are not well designed. Choosing one of these could lead to expensive implementation, unprotected data, costly downtime, and no upgrade path to newer and better technologies.
Several of the older solutions were the leaders in their day. However, their proprietary designs cannot fully leverage IBM's OS400 enhancements, particularly the improvements released in V5. Further, because of their archaic designs, their frameworks cannot be upgraded short of a massive re-write, and that will not happen because it would be more economical for the developers to start from scratch.
Several newer solutions do take advantage of IBM's remote journaling, which was first introduced in V5R1. But in the rush to get their product to the market, the designers did not fully anticipate all aspects of providing for a streamlined deployment, data integrity, removing unnecessary latency, ease-of-use features, and accommodating future technology upgrades.
The following considerations will help you cut through the sales pitches and zero-in on the key design elements that would make an OS400 HA solution right for you.
When evaluating High Availability solutions, be sure to look "under the covers." Here's what to look for:
Factors that Contribute Latency
The primary objective of OS400 HA software is high-speed replication of production data to the target system. The goal is to have the target system backup be close to the production system in real time. In the event of a production system failure requiring a role swap, you are back up and running without delay and loss of data.
Latency means the delay in getting the changed objects from the production machine to the target machine. Many HA solutions have design flaws that increase latency time. Unfortunately, many users experience delays that range from 30 minutes to several days.
Here are some factors that contribute to latency:
- Proprietary design that leverages the system audit journal. With this approach, the HA software works "above" OS400 - adding more cycles and complexity to achieve what IBM's OS400 integrated remote journaling accomplishes "within" OS400, thus saving precious CPU cycles.
- Journal time stamps. As a changed object is recorded in the system audit journal it is time stamped, and this takes time.
- Proprietary harvest routines. Afterwards, the proprietary software "harvests" the time stamped change objects, another time-consuming process, to transmit to the target server.
- Proprietary apply routines. On the target system, the proprietary software receives, analyzes, sequences and applies the changed objects - all time-consuming extra steps that are performed automatically as part of the remote journaling process built into OS400 V5.
- Thread contention. As previously mentioned, this occurs when there are huge volumes of change objects to be applied on a target system.
Misuse of the System Audit Journal
The system audit journal is used to monitor object activity and log security events.
IBM documentation says: "The auditing journal, QSYS/QAUDJRN, is intended solely for security auditing. Journal entries can include:
- Identification of the job and user and the time of access
- Before-and-after images of all object changes
- Records of when the object was opened, closed, changed, and saved.
Even though many AS400 users do not think they are journaling to avoid performance degradation, many systems do in fact run the audit journal - either as part of an initial setup (it may have been intentional or have been forgotten) or as a standard procedure used by their application software to monitor data integrity and security breaches.
What does the audit journal have to do with High Availability? Well, several HA solutions use the audit journal to identify ALL changed objects to send from the production system to the target system.
There are three fundamental design flaws with this approach.
First, when HA software uses the audit journal, it creates a huge increased output of redundant data that must be stored on the production machine. This requires more DASD, which can be costly as well as degrade the overall performance of OS400 data management.
Second, if you use the audit journal as it is intended to be used (to monitor security and data integrity) what would otherwise be short 1-2 weekly monitoring reports instead become massive printouts whose critical issues get lost in the clutter of changed objects that get transmitted to the target machine.
Third, if your use of the audit journal is part of your application software (a common practice with many business applications, both custom and packaged), the HA implementation team must make program alterations to adapt this new use of the audit journal for HA as well as to ensure that the audit journal continues to operate correctly within your application. Here are two big questions:
- Do you know how much this will add to your HA implementation cost?
- Do you really want someone unfamiliar with your applications to be changing them?
Threads vs. Multiple Apply Groups
A thread is short for a thread of execution. Threads are one way for a program run a task. Each thread runs its code independently of the other threads in the program
Threads are distinguished from traditional multitasking processes (multiple apply groups designed for high-performance caching) in the following ways:
Threads are processes that:
- are typically independent;
- carry considerable "state" information;
- have separate address spaces;
- interact only through system-provided mechanisms.
What has this got to do with HA software?
Some HA applications create a single thread to apply a changed object to the target system. Quite simply, the target system is idle until a changed object is detected. At that time, an apply thread is created. The thread goes away after apply is completed, returning the system to an idle state, until the next changed object is detected.
Does it sound good? Maybe, but let's take a closer look.
With the thread design, when there are lots of changed objects, lots of threads get created, one after another. If too many threads are created, there may be insufficient CPU resource to address them. This creates thread contention, ultimately causing a potential performance bottleneck. This can slow down the apply process, causing your target system to run behind for as much as 30 minutes to several days, for a change that occurred on the production system. With this type of thread contention, you can severely impact the benefits of HA.
By contrast, an HA system with multiple apply groups (processes) is designed to accommodate high-performance applies of changed objects to the target system. In fact, a fundamental design feature uses multiple applies, and additional apply groups can be "turned on" when the volume of changed objects increases dramatically. It is this design point that allows some HA software to keep real-time pace with as many as 1,000,000,000 (that's 1 billion) changed objects in a day!
Before-and-After Image Comparison and Apply vs. Just After-Image Apply
Many HA solutions only transmit the after-image, or changed object, to the target system.
Why is that significant?
Well, with only the after-image, you cannot confirm that the data is not out of sequence or corrupted. This approach creates serious data integrity issues.
A superior design involves the "before" and "after" images. This allows the target system to perform data integrity checks to confirm the changed object is correct and in the proper sequence. This is the type of design you need to maintain high standards for data integrity.
Moral: Selecting the wrong HA application can set you back in more ways than one.
Self-Healing with NO "Reason" Code
Self-healing software, or autonomics, is a newly popular concept with software.
Sure, it sounds great if a program automatically fixes itself for you. Think of all the time it saves, right?
Not necessarily. Actually, if the software self-heals, it is most likely fixing just a symptom of what may be an even bigger problem. And without a "reason" code, it is impossible to properly diagnose the problem. If you do not address the problem quickly and effectively, additional more severe problems could develop later.
Clearly, there is a better way.
First, a superior design makes self-healing an option you can turn on or off. Many IT shops are staffed so a key person can immediately respond to a problem as soon as it occurs. Then it can be documented, analyzed, and fixed - the symptom AND the cause.
Second, a superior design provides for a "reason" code. This means that when self-healing is turned on, a problem gets logged with the reason why the problem occurred; only then does the problem get self-healed. In this way, IT can review the problem at a later date. It has a log of the problem and a reason code, so it knows where to look to fix the cause as well as the symptom. This is vital since many undocumented problems can never be re-created, let alone detected, without a log.
If you are using self-healing software that is not optional, i.e., you can turn it on or off, and does not include a reason code in the diagnosis, would increase the risks with your system as well as creating data integrity issues.
Conclusion: Selecting the wrong HA application can adversely affect you in more ways than one.
HA Application Design
Take a close look at the HA application design.
Moral: Selecting the wrong HA application can set you back in more ways than one.
- Is it designed for rapid deployment or to maximize the vendor's billable services?
Some designs are so straightforward you can download them from the Web and easily adapt them to your HA environment by using GUI tools. These tools provide simple and consistent point-and-click and drop-and-drag functionality on both the production and the target servers. In fact, such a setup may take hours instead of days or even weeks.
In contrast, some HA applications require a technician onsite for a week up to several weeks. Why? Because their design, at best, is more like "cut-and-paste" templates that MUST be uniquely tailored to each user's environment. Further, if the design uses the system audit journal, you can bet the needed custom programming to make the HA solution work will take longer - and cost more. So ask:
- How long does it take to install your software?
- Can you explain in some detail what steps are involved?
- May I talk to some of your customers (to ask what the install experience was really like)?
Many HA designs did not adequately anticipate changes to IBM's OS400. IBM consistently introduces new enhancements to their operating system. How can these changes be handled? (Remember, IBM introduced remote journaling in V5 and the traditional HA provider still cannot natively leverage this superior approach to journaling without a massive rewrite and major field upgrade that would be far too costly and painful for users to tolerate.)
Many HA designs do not anticipate how to handle changes to their own design. Most provide for GUI with "bolt-on" screen-scrapers instead of natively incorporating a GUI design. Most require version upgrades that involve programming changes instead of a straightforward enhancement that you can simply apply at your convenience. This means you might get trapped in an old HA version.
(Continued to Next Page)
- How easy is the HA system administration?
Every HA solution requires monitoring and support. The question is, do you spend a few minutes or upwards of 30 hours a week to make sure everything is running properly?
Here again, design can make a huge difference. Requiring as little as 1-15 minutes per day, a few HA solutions are designed with a straightforward dashboard with GUI icons that provide simple status measurements - green means "OK", yellow means "Notice", and red means "Attention Now" - with easy drill down to the element that needs attention. In contrast, other HA systems require a trained individual to access many unique menu items and views to get a handle on the big picture, and then access a different set to find out what the problem might be. This laborious process can take upwards of 30 hours per week.