I’m working on setting up FreeRADIUS servers. These authenticate users against an Active Directory using the ntlm_auth and winbind programs from the Samba suite.
Hit an interesting issue today, which wasn’t a straightforward answer. FreeRADIUS had been running for a day or so, and it was reported that people had dropped of the network. Sure enough, FreeRADIUS had stopped sending Access-Accept packets an hour or so before.
A cursory glance at the log files looked strange. The radiusd.log file ran through everything as usual, running ntlm_auth and getting a successful login. After that, it just stopped, and the client timed out without getting a response.
Trying with an invalid password worked – an Access-Reject was sent back. Similarly, trying with a proxied account also worked (Accept returned fine) as well as a local account defined in the FreeRADIUS server. So it looked like a problem in the mschap module not handling the response from ntlm_auth correctly.
ntlm_auth itself seemed fine – running from the command line it would authenticate a user with NT_STATUS_OK or NT_STATUS_WRONG_PASSWORD as appropriate.
So, a lot of staring into the FreeRADIUS logs ensued, together with some stracing to try and work out just what it was doing. After a while, it became apparent that there just wasn’t the required debug info in the logs, so I tried sending it a HUP – still no joy. Therefore a full restart of FreeRADIUS, which must fix it, surely? Everything else is fine.
But no – it had the same issue.
The next thing was to restart winbind, which it seemed just couldn’t have anything to do with it as ntlm_auth was already working. It all sprang into life…
This seems a very odd failure scenario – everything works, with success or failure, but something wasn’t getting from winbind via ntlm_auth correctly back to FreeRADIUS, and causing it to silently fail. The type of error that any system administrator dreads.
So this all served as a reminder for the service checks – this sort of thing needs to be highlighted quickly once we’re in production, and a quick ‘is the daemon running’ won’t be good enough. It’s good practice to test the whole stack anyway, and this is the exact scenario where a partial check could show everything as working, even though it’s not.