It is not uncommon for us to configure a customer’s high availability system for single sign-on.
Recently, though, two different customers called me after a role swap saying that SSO was not working.
I had worked with both customers prior to their role swaps to ensure that their HA systems were properly configured and that SSO would continue to work after the role swap.
Both companies had done the necessary setup required prior to role swap. This entailed three different steps.
- Configure their HA software packages to replicate the Network Authentication Service (i.e. Kerberos) configuration and keytab file to the backup system. This is done by telling the HA software to replicate the /QIBM/userdata/OS400/networkauthentication directory.
- Use the EIM configuration wizard to create a local EIM repository on the backup system and configure the backup system to use the local EIM repository.
- Configure the HA software to replicate the required LDAP libraries and configuration directories. Depending on your software package, this step may also include setting up peer-to-peer LDAP replication between the production and the backup systems.
Both companies were relying on switching the DNS hostname entries for the production and backup systems. For example, if the production machine’s fully qualified hostname was prod.myco.com with address 10.0.1.2 and the backup system’s name was ha.myco.com with address 10.0.1.3, after the role swap the DNS entries would be prod.myco.com with address 10.0.1.3 and ha.myco.com with address 10.0.1.2.
These companies followed their role swap scripts to a “T.” Therefore I was a bit surprised when the first customer called to say that, after the role swap, single sign-on was not working.
Everything came up and worked perfectly – except for SSO. They were getting an “unable to find service principal” error message.
Diagnosing the SSO Problem
Usually when we see this error message it has something to do with the client application trying to connect with a hostname that wasn’t configured or an incorrect entry in the DNS. Checking these proved that this was not the issue.
Next we used QShell to run the “kinit” Kerberos login program and logged in with the service principal and password from the keytab file. That was successful. Now this was interesting. This seemed to suggest that the problem was somewhere between the client application — PC5250 Telnet emulator in this case – and the Windows domain controller (i.e. the Kerberos KDC.)
After scratching our collective heads a bit, it occurred to me that the PC5250 emulator, like a lot of Kerberos client applications, takes the target hostname and does a forward DNS lookup to get the address, then it does a reverse lookup to get the primary DNS hostname of the target. This is because Kerberos service principles are often only defined for the primary hostname of a target.
That was the clue as to what was wrong.
DNS Tip for High Availability and SSO
When the DNS hostnames were switched, the current entries (i.e. “A” records) were deleted. This deleted the reverse lookup entries (i.e. “PTR” records.) The new hostnames were added with the opposite TCP/IP addresses according to the role swap script. However, the role swap script did not include the requirement to set up the reverse lookup entries! Because the reverse lookup was failing in the PC5250 client, it couldn’t build a service principal name recognized by the Windows domain controller. Simply adding the reverse lookup for the new production machine DNS entry fixed the problem.
Of course, the role swap scripts were also changed to add the additional steps.
Second Test Case
Just two days later, we received a call from the second customer describing what sounded like the exact same problem. We immediately told them to add the reverse lookup entry, and everything started working again.
They were very impressed that we could identify and fix the problem so quickly. We, of course, attributed that to the immense breadth and width of our knowledge of all things SSO.
I just hope they don’t read this article.