CA7 Server for AAI Disaster Recovery Planning

As the use of AAI increases and becomes more critical to the general management of all scheduled workloads, it is essential to ensure that both AAI and the data feed process operate continuously. This topic explains how to implement a Disaster Recovery (DR) environment and how to proceed in the event of a situation that requires the switch to a DR system.

This page includes the following:

Recommendations

Create the DR environment based on the architecture graphic. For more information, see Architecture of the CA7 Server for AAI .

When implementing the CA7 Server for AAI, consider the following factors. They help minimize or eliminate DR activities at the mainframe side in the event of a DR switch:

  • Implement Virtual IP Addressing (VIPA) for the AAI system such that the IP Address would follow the AAI system when it is switched from the Primary server to a DR server.

  • Define the same User ID used for the data delivery mechanism (SFTP/FTP/FTPS) on both the Primary and DR servers.

  • For SFTP data delivery, configure the mainframe STC User ID public key into the .ssh/authorized_keys file of the server User ID on both the Primary and DR servers.

  • For FTP data delivery, ensure the password for the server User ID is the same on both the Primary and DR servers.

  • For FTPS (SSL/TLS), ensure that both the Primary and DR servers are configured for Host authentication on the mainframe if that is used.

  • Data delivery from a CA7 Server for AAI system is targeted to a specific directory path location. Create the same directory path/location(s) on both the Primary and DR servers.

Scenarios

There are two main scenarios that could occur in a DR situation.

  • The AAI primary server fails completely and is no longer active so a DR switch occurs.

  • The LPAR where the mainframe scheduler is running fails and a DR switch occurs.

You will have to perform different actions depending on which scenario occurs and on whether the previous recommendations have been implemented when creating the AAI system DR environment.

Note:

The actual process of performing a DR switch for the AAI server to the DR server should be based on AAI Best Practices and recommendations which are not covered here.

Scenario 1 – AAI Primary Server Failure

This section explains how the CA7 Server for AAI systems will react to an AAI DR switch when the AAI primary server fails. It describes what should happen following the DR switch of AAI to ensure no loss of data from the source scheduler.

Description of the Scenario

  • The AAI primary server has completely failed and is no longer active

  • The CA7 Server for AAI system on the mainframe is unable to connect to the failed server and data delivery fails.

  • The CA7 Server for AAI system is robust. It continues to attempt to create and deliver the Event data, but will start to issue Data Delivery Failed messages to the z/OS console (+AI7.SCN2E console messages). For more information, see Automating an Instance of the CA7 Server for AAI.

  • During this time the CA7 Server for AAI recognizes that the Event Data has not been delivered and will not update the Checkpoint date/time value that records the last successful delivery.

Actions

If the AAI primary and DR environments have been created based on the previously described recommendations around Virtual IP Address, common User ID and common directory locations, then no further actions around the CA7 Server for AAI should be required with regard to Event data.

When the AAI system is started on the DR server, the following happens:

  • The Virtual IP Address (when common for both servers) effectively becomes available again.

  • The CA7 Server for AAI automatically re-connects.

  • The CA7 Server for AAI identifies the last successful Event delivery from the Checkpoint.

  • It creates the next RPT70 file containing all the missing Event data since the connection was lost at Primary server failure.

This assumes that the AAI switch to the DR server does not take a significant amount of time to occur such that all the Event data is still available from the Active LOGP/LOGS datasets. If the DR switch is delayed for more than 1 hour, then the following is recommended:

  • Stop the CA7 Server for AAI system while the DR switch occurs

  • Restart the CA7 Server for AAI in WARM mode when the AAI DR server has been successfully activated.

The CA7 Server for AAI will then use the Checkpoint information to perform Event data recovery using the CA7 History Log files if needed.

If the DR switch of AAI occurs when a Definition Data delivery is in progress, then it is likely this will fail. The console message +AI7.SCN5E will be issued. Review the CA7 Server for AAI STC log to determine if the Data Definition Delivery process completed successfully or not. If not, use either the Request STC or a manual action via the IMS to re-issue the Definition Data Delivery request to the CA7 Server for AAI when the AAI system is active again on the DR server.

Scenario 2 – LPAR Fails

This section explains the preparations that are necessary to implement the CA7 Server for AAI system in a mainframe DR environment. It also describes what must happen with the CA7 Server for AAI system in the event of an LPAR switch to that DR environment, to again ensure that no data loss from the source scheduler occurs.

Description of the Scenario

  • The LPAR where CA7 itself and the CA7 Server for AAI are running fails.

  • A DR switch occurs to another LPAR.

    The DR LPAR requires that CA7 Server for AAI is also installed on that LPAR and the CA 7 Instance is defined with identical configuration settings.

Situation A: Real-Time Replication is in Use Between the Primary LPAR and the DR LPAR

The SRVRCKPT dataset should likewise be replicated to the DR LPAR.

Starting the CA7 Server for AAI instance on the DR LPAR in WARM mode automatically determines the last successful Event Data delivery from the SRVRCKPT dataset. It restarts the Event Data delivery from the time of failure of the Primary LPAR.

Situation B: Real-Time Replication is NOT in Use Between the Primary LPAR and the DR LPAR

In this case, it is necessary to ensure that the last successful Event Data delivery date and time is set correctly in the SRVRCKPT dataset before the CA7 Server for AAI system is started on the DR LPAR. You can do this manually via the IMS Menu Option 3 for the CA7 Server for AAI instance that is moving to the DR LPAR. For more information, see Event Recovery Checkpoint Override.

As AAI supports the overlap of Event Data from CA7, it is only necessary to ensure that the date/time value in the SRVRCKPT dataset is set to a date/time value that is from some time before the initial failure of the Primary LPAR.

Once the date/time value has been successfully updated, then the CA7 Server for AAI can be started in WARM mode. This causes an Event Recovery to occur, starting from the date/time specified in the SRVRCKPT dataset, as updated through the IMS.

Next Step

The last step to integrate the CA7 environment with AAI is to create the CA7 scheduler within AAI. For more information, see: