APAR status
Closed as program error.
Error description
IBM MQ channels might go unresponsive with high CPU usage in channel process amqrmppa or runmqchl if the channel synchronization record is corrupted. If the problem affects the channels on the SENDER side(e.g. SDR or CLUSSDR) channels then no message will be sent by the channels. The “DIS CHS” output likely to show no value in the SUBSTATE field.
AMQ8417I: Display Channel Status details.
CHANNEL(CLUSCHL1) CHLTYPE(CLUSSDR)
…
RQMNAME(RQM1) STATUS(RUNNING)
SUBSTATE( )
XMITQ(SYSTEM.CLUSTER.TRANSMIT.CLUSCHL1)
If the affected channels are on the receiver side (e.g. RCVR or CLUSRCVR) then the channel process on the receiver side consumes high CPU with the corresponding SDR or RCVR channel going into retrying state.
The top output for the affect channel process shows high CPU usage.
PID USER PR NI VIRT RES SHR S %CPU %MEM
TIME+ COMMAND
133101 mqm 20 0 265412 13836 11584 R 106.2 0.4
2:08.97 runmqchl
133101 mqm 20 0 265412 13836 11584 R 99.7 0.4
2:11.97 runmqchl
133101 mqm 20 0 265412 13836 11584 R 99.3 0.4
2:14.97 runmqchl
133101 mqm 20 0 265412 13836 11584 R 99.7 0.4
2:17.97 runmqchl
IBM MQ trace shows the channel process repeatedly calling rflSeekBytes and rflReadBytes with the same pattern of comparison and file pointers.
Example:
19:54:22.511503 133101.1 RSESS:000001
NewFilePointer=716(0x000002cc) <———
19:54:22.511528 133101.1 RSESS:000001
NewFilePointer=1072(0x00000430)
19:54:22.511552 133101.1 RSESS:000001
NewFilePointer=1428(0x00000594)
19:54:22.511577 133101.1 RSESS:000001
NewFilePointer=1784(0x000006f8)
19:54:22.511603 133101.1 RSESS:000001
NewFilePointer=2140(0x0000085c)
19:54:22.511628 133101.1 RSESS:000001
NewFilePointer=2496(0x000009c0)
19:54:22.511653 133101.1 RSESS:000001
NewFilePointer=2852(0x00000b24)
19:54:22.511678 133101.1 RSESS:000001
NewFilePointer=716(0x000002cc) <———
19:54:22.511704 133101.1 RSESS:000001
NewFilePointer=1072(0x00000430)
19:54:22.511728 133101.1 RSESS:000001
NewFilePointer=1428(0x00000594)
19:54:22.511753 133101.1 RSESS:000001
NewFilePointer=1784(0x000006f8)
19:54:22.511778 133101.1 RSESS:000001
NewFilePointer=2140(0x0000085c)
19:54:22.511806 133101.1 RSESS:000001
NewFilePointer=2496(0x000009c0)
19:54:22.511831 133101.1 RSESS:000001
NewFilePointer=2852(0x00000b24)
Local fix
Stop the queue manager
Backup the queue manager
Rename the sync file AMQRSYNA.DAT
Start the queue manager with -ns option (strmqm -ns QM)
Recreate the channel sync file ( rcrmqobj -m QM -t syncfile )
stop the queue manager
start the queue manager
Problem summary
USERS AFFECTED:
All users of IBM MQ distributed channels who have a corrupted channel synchronization record in the channel sync file. Corruption of this file is not an expected or typical usage pattern, and has not been observed as a result of any known product defect.
The channel sync file is used by all queue manager channel types except SVRCONN/CLNTCONN and AMQP channels.
Platforms affected:
MultiPlatform
****************************************************************
PROBLEM DESCRIPTION:
The IBM MQ channel process was not detecting a corruption in the channel synchronization record and this caused infinite loop, resulting in the channel going into an unresponsive state.
Problem conclusion
The IBM MQ code has been modified prevent infinite loop if the channel synchronization record is corrupted.
This APAR does not address the corruption in the channel synchronization record itself, as the cause of the corruption at the time this issue was observed remains unknown.
With the fix applied, if the queue manager detects an infinite loop when finding a channel record in the channel synchronization file, the queue manager generates the following error message and the channel goes into retrying state.
————————
02/28/2022 10:37:07 PM – Process(181467.1) User(root)
Program(runmqchl)
Host(host1.ibm.com) Installation(Installation1)
VRMF(9.1.0.7) QMgr(qm1)
Time(2022-03-01T06:37:07.434Z)
ArithInsert1(1017)
CommentInsert1(AMQRSYNA.DAT)
AMQ9516E: File error occurred for file ‘AMQRSYNA.DAT’.
EXPLANATION:
The filesystem returned error code 1017 for file ‘AMQRSYNA.DAT’.
ACTION:
Record the name of the file and tell the systems administrator, who should ensure that file is correct and available, for example that the current user has appropriate access to the file for reading or writing.
————————
The user needs to take appropriate action to resolve the issue i.e. rebuild the syncfile using rcrmqobj in this case. To rebuild the syncfile, check the Local Fix/Workaround section
The queue manager also generates the following failure data capture (FDC) record.
AMQ184577.0.FDC 2022/03/01 17:37:07.740247-8 Installation1
runmqchl 184577 1 RM738001 rflFindRecord Unknown(3F9)
Probe Id :- RM738001
Application Name :- MQM
Component :- rflFindRecord
Program Name :- runmqchl
Arguments :- -c “CHL9 “ -m “qm1
Major Errorcode :- Unknown(3F9)
The fix is targeted for delivery in the following PTFs:
Version Maintenance Level
v8.1 8.1.0.15
The latest available maintenance can be obtained from
‘Recommended Fixes for IBM MQ’
http://www-1.ibm.com/support/docview.wss?rs=171&uid=swg27006037
If the maintenance level is not yet available information on
its planned availability can be found in ‘IBM MQ planned
maintenance release dates’
http://www-1.ibm.com/support/docview.wss?rs=171&uid=swg27006309
Temporary fix
Comments
APAR Information
APAR number IT43831
Reported component name MQ FOR HPE NS O
Reported component ID5724A3904
Reported release 810
Status CLOSED PER
PE NoPE
HIPER YesHIPER
Special Attention NoSpecatt / Xsystem
Submitted date 2023-05-25
Closed date 2023-06-01
Last modified date 2023-06-01
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
Fix information
Fixed component nameMQ FOR HPE NS O
Fixed component ID5724A3904
Applicable component levels