Whats a RLOCK ERROR?
#32238
29 Jan 20 11:49 AM
|
Joined: Sep 2003
Posts: 4,158
Steve - Caliq
OP
Member
|
OP
Member
Joined: Sep 2003
Posts: 4,158 |
Curious any thoughts on what may cause a RLOCK ERROR and hang everyone running ashell? We killed all the Linux jobs off, cleared qflock and jobtbl etc and then everything was fine again, Running Ashell 6.5.1643.0. Only BBS post I can see is here.snippet from ashlog.log
29-Jan-20 11:17:58 [p12733-0]<:(nil)> !rlock signal interrupt (^C - return EACCES); fh=5, wh=0, op=1, wt=1, siz=51200
29-Jan-20 11:17:58 [p12733-0]<:(nil)> !!!RLOCK ERROR: fh=5, whence=0, op=1, siz=51200, wt=1, rc=13
29-Jan-20 11:17:58 [p12733-0]<:(nil)> Erno 13 (Permission denied)
29-Jan-20 11:17:58 [p12733-0]<:(nil)> Launch Error: [pts/107:12733] Fatal Error! Unable to access job control block (jobtbl.sys)
29-Jan-20 11:18:02 [p12733-0]<:(nil)> !rlock signal interrupt (^C - return EACCES); fh=5, wh=0, op=1, wt=1, siz=51200
29-Jan-20 11:18:02 [p12733-0]<:(nil)> !!!RLOCK ERROR: fh=5, whence=0, op=1, siz=51200, wt=1, rc=13
29-Jan-20 11:18:02 [p12733-0]<:(nil)> Erno 13 (Permission denied)
29-Jan-20 11:18:02 [p12733-0]<:(nil)> Unable to read/lock JOBTBL header in QPURGE!
snippet from /var/log/messages
Jan 29 11:18:48 localhost xinetd[2320]: START: telnet pid=12842 from=::ffff:192.168.1.66
Jan 29 11:19:07 localhost xinetd[2320]: START: telnet pid=12882 from=::ffff:192.168.1.37
Jan 29 11:19:29 localhost xinetd[2320]: START: telnet pid=12918 from=::ffff:192.168.1.108
Jan 29 11:19:38 localhost xinetd[2320]: START: telnet pid=12950 from=::ffff:192.168.1.154
Jan 29 11:19:38 localhost telnetd[12950]: ttloop: peer died: EOF
Jan 29 11:19:38 localhost xinetd[2320]: EXIT: telnet status=1 pid=12950 duration=0(sec)
Jan 29 11:19:38 localhost xinetd[2320]: EXIT: telnet status=0 pid=12482 duration=264(sec)
Jan 29 11:21:17 localhost xinetd[2320]: START: telnet pid=13048 from=::ffff:192.168.10.25
Jan 29 11:21:31 localhost xinetd[2320]: EXIT: telnet status=0 pid=11771 duration=790(sec)
Jan 29 11:21:37 localhost xinetd[2320]: START: telnet pid=13084 from=::ffff:192.168.1.36
Jan 29 11:22:10 localhost smbd[2539]: [2020/01/29 11:22:10.528535, 0] smbd/server.c:281(remove_child_pid)
Jan 29 11:22:10 localhost smbd[2539]: Could not find child 13139 -- ignoring
Jan 29 11:22:48 localhost xinetd[2320]: EXIT: telnet status=0 pid=5503 duration=3793(sec)
Jan 29 11:22:48 localhost xinetd[2320]: EXIT: telnet status=0 pid=28258 duration=11755(sec)
Jan 29 11:22:48 localhost xinetd[2320]: EXIT: telnet status=0 pid=22438 duration=19482(sec)
Jan 29 11:22:48 localhost xinetd[2320]: EXIT: telnet status=0 pid=27404 duration=12450(sec)
Jan 29 11:22:48 localhost xinetd[2320]: EXIT: telnet status=0 pid=1795 duration=6841(sec)
Jan 29 11:22:48 localhost xinetd[2320]: EXIT: telnet status=0 pid=22584 duration=19366(sec)
Jan 29 11:22:48 localhost xinetd[2320]: EXIT: telnet status=0 pid=28573 duration=11600(sec)
Jan 29 11:22:48 localhost xinetd[2320]: EXIT: telnet status=0 pid=13048 duration=91(sec)
Jan 29 11:22:48 localhost xinetd[2320]: EXIT: telnet status=0 pid=22667 duration=19361(sec)
|
|
|
Re: Whats a RLOCK ERROR?
[Re: Steve - Caliq]
#32239
29 Jan 20 03:03 PM
|
Joined: Jun 2001
Posts: 11,794
Jack McGregor
Member
|
Member
Joined: Jun 2001
Posts: 11,794 |
The RLOCK ERROR on file handle 5, with a size of 51200, is almost certainly an attempt to get an exclusive lock on the JOBTBL.SYS (or maybe the QFLOCK.SYS) in order to update it with the job that is starting up. It fails with permission denied, which I think indicates that some part of the file is already locked by another user. The culprit was that other user, not the one shown in the log above. The usual symptom isn't a total lockup, rather that it becomes impossible for jobs to start or finish. (Once the job is added to the jobtbl, from that point on it only needs access to it's own record for normal operations.) The easiest way to identify the culprit (in the moment) is with the lslk (list locks) command. At this point, the ashlog might contain a clue if there was some kind of error reported just before the RLOCK errors started showing up. More likely, whatever the initial error was, it wasn't traced (since that would suggest that the problem was being handled, which clearly it wasn't). Reports of this problem have become pretty rare, as the potential loopholes and windows of opportunities have gradually been identified and closed. An update might help, although it's difficult to say for sure without a better diagnosis of what actually happened.
|
|
|
Re: Whats a RLOCK ERROR?
[Re: Steve - Caliq]
#32241
29 Jan 20 03:08 PM
|
Joined: Sep 2003
Posts: 4,158
Steve - Caliq
OP
Member
|
OP
Member
Joined: Sep 2003
Posts: 4,158 |
Thanks Jack, Thanks Steven for the info, as I think its the first we seen... maybe its the last too? but it will give as more things to check IF it does ever occur again.
|
|
|
Re: Whats a RLOCK ERROR?
[Re: Steve - Caliq]
#33684
07 Dec 20 02:10 PM
|
Joined: Jun 2001
Posts: 153
OmniLedger - Tom Reynolds
Member
|
Member
Joined: Jun 2001
Posts: 153 |
Hi guys, I thought I'd tag onto the end of this as it's very similar to an issue I'm having on one of our servers. It appears to have suddently started generating rlock errors, but we've not really encountered them before and are struggling to diagnose the cause.
|
|
|
Re: Whats a RLOCK ERROR?
[Re: Steve - Caliq]
#33685
07 Dec 20 02:20 PM
|
Joined: Sep 2002
Posts: 5,471
Frank
Member
|
Member
Joined: Sep 2002
Posts: 5,471 |
Good Morning Tom -
Just looking at the thread seems to be a problem with a lock on jobtbl.sys. Guessing this is a windows server? I would kill all the connected ashell sessions and see if that works, or if possible reboot the server should clear everything.
|
|
|
Re: Whats a RLOCK ERROR?
[Re: Steve - Caliq]
#33686
07 Dec 20 05:13 PM
|
Joined: Jun 2001
Posts: 11,794
Jack McGregor
Member
|
Member
Joined: Jun 2001
Posts: 11,794 |
I think Frank is right on this one. "RLOCK" is just the name of an A-Shell routine that handles Random file LOCKing (and not the "Re-entrant LOCK routine Stephen uncovered). As noted in my January post above, in most systems, by far the biggest use relates to the JOBTBL.SYS. Every operation that involves scanning that file (such as the start or end of a session, or a recalculation of the license usage) starts by locking the control record. Over the years, we've closed a number of small windows of opportunity through which a job could get itself stuck while it had a lock on the file, which effectively prevents any new sessions from starting or ending. (It typically doesn't affect running sessions, because even though they frequently read/write their JOBTBL record, there's no need to lock that operation.)
I don't think we have had any reports of this problem since 6.5.1672. Under 6.4, maybe once every couple of months we hear of a case, but I usually start by advising them to update, after which the problem seems to disappear. (Which is in many ways the perfect outcome, except that it hasn't forced us to try to identify and fix the issue in 6.4.)
If we are to resolve it another way, the place to start would be by looking at the ashlog.log file to get some clues. Since the problem is most noticeable to the next job to launch, rather than the one actually responsible, it helps greatly if you have the TRACE=INOUT option set in the miame.ini. (I always recommend TRACE=SIGHUP,INOUT,BASERR since that provides the most critical information needed for this kind of forensic analysis without getting too verbose.) Hopefully there is some kind of clue logged for previous job to launch or shut down. For example, if it was triggered by a SIGHUP, but there is no "After qpurge and closing" message, that indicates that the job somehow got stuck while closing down. Based on the clues in the log, maybe we can identify the exact trigger.
|
|
|
Re: Whats a RLOCK ERROR?
[Re: Steve - Caliq]
#33687
09 Dec 20 09:05 AM
|
Joined: Jun 2001
Posts: 153
OmniLedger - Tom Reynolds
Member
|
Member
Joined: Jun 2001
Posts: 153 |
Thanks, I'll expand the loggin to cover those to see if it highlights where things are going awry.
|
|
|
|
|