IT43831: IBM MQ Channels Might Go Unresponsive With High CPU Usage in Channel Process If Channel Synchronization Record is Corrupted

APAR status

Closed as program error.

 

Error description

IBM MQ channels might go unresponsive with high CPU usage in channel process amqrmppa or runmqchl if the channel synchronization record is corrupted.  If the problem affects the channels on the SENDER side(e.g. SDR or CLUSSDR) channels  then no message will be sent by the channels. The “DIS CHS” output likely to show no value in the SUBSTATE field.

 

AMQ8417I: Display Channel Status details.

CHANNEL(CLUSCHL1)      CHLTYPE(CLUSSDR)

RQMNAME(RQM1)            STATUS(RUNNING)

SUBSTATE( )

XMITQ(SYSTEM.CLUSTER.TRANSMIT.CLUSCHL1)

 

If the affected channels are on the receiver side (e.g. RCVR or CLUSRCVR) then the channel process on the receiver side consumes high CPU with the corresponding SDR or RCVR channel going into retrying state.

The top output for the affect channel process shows high CPU usage.

 

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM

TIME+ COMMAND

133101 mqm       20   0  265412  13836  11584 R 106.2   0.4

2:08.97 runmqchl

133101 mqm       20   0  265412  13836  11584 R   99.7   0.4

2:11.97 runmqchl

133101 mqm       20   0  265412  13836  11584 R   99.3   0.4

2:14.97 runmqchl

133101 mqm       20   0  265412  13836  11584 R   99.7   0.4

2:17.97 runmqchl

 

IBM MQ trace shows the channel process repeatedly calling rflSeekBytes and rflReadBytes with the same pattern of comparison and file pointers.

Example:

 

19:54:22.511503   133101.1      RSESS:000001

NewFilePointer=716(0x000002cc) <———

19:54:22.511528   133101.1      RSESS:000001

NewFilePointer=1072(0x00000430)

19:54:22.511552   133101.1      RSESS:000001

NewFilePointer=1428(0x00000594)

19:54:22.511577   133101.1      RSESS:000001

NewFilePointer=1784(0x000006f8)

19:54:22.511603   133101.1      RSESS:000001

NewFilePointer=2140(0x0000085c)

19:54:22.511628   133101.1      RSESS:000001

NewFilePointer=2496(0x000009c0)

19:54:22.511653   133101.1      RSESS:000001

NewFilePointer=2852(0x00000b24)

19:54:22.511678   133101.1      RSESS:000001

NewFilePointer=716(0x000002cc)  <———

19:54:22.511704   133101.1      RSESS:000001

NewFilePointer=1072(0x00000430)

19:54:22.511728   133101.1      RSESS:000001

NewFilePointer=1428(0x00000594)

19:54:22.511753   133101.1      RSESS:000001

NewFilePointer=1784(0x000006f8)

19:54:22.511778   133101.1      RSESS:000001

NewFilePointer=2140(0x0000085c)

19:54:22.511806   133101.1      RSESS:000001

NewFilePointer=2496(0x000009c0)

19:54:22.511831   133101.1      RSESS:000001

NewFilePointer=2852(0x00000b24)

 

Local fix

Stop the queue manager

Backup the queue manager

Rename the sync file AMQRSYNA.DAT

Start the queue manager with -ns option (strmqm -ns  QM)

Recreate the channel sync file ( rcrmqobj -m QM -t syncfile )

stop the queue manager

start the queue manager

 

Problem summary

USERS AFFECTED:

All users of IBM MQ distributed channels who have a corrupted channel synchronization record in the channel sync file. Corruption of this file is not an expected or typical usage pattern, and has not been observed as a result of any known product defect.

The channel sync file is used by all queue manager channel types except SVRCONN/CLNTCONN and AMQP channels.

 

Platforms affected:

MultiPlatform

 

****************************************************************

PROBLEM DESCRIPTION:

The IBM MQ channel process was not detecting a corruption in the channel synchronization record and this caused infinite loop, resulting in the channel going into an unresponsive state.

 

Problem conclusion

The IBM MQ code has been modified prevent infinite loop if the channel synchronization record is corrupted.

 

This APAR does not address the corruption in the channel synchronization record itself, as the cause of the corruption at the time this issue was observed remains unknown.

 

With the fix applied, if the queue manager detects an infinite loop when finding a channel record in the channel synchronization file, the queue manager generates the following error message and the channel goes into retrying state.

 

————————

02/28/2022 10:37:07 PM – Process(181467.1) User(root)

Program(runmqchl)

Host(host1.ibm.com) Installation(Installation1)

VRMF(9.1.0.7) QMgr(qm1)

Time(2022-03-01T06:37:07.434Z)

ArithInsert1(1017)

CommentInsert1(AMQRSYNA.DAT)

 

AMQ9516E: File error occurred for file ‘AMQRSYNA.DAT’.

 

EXPLANATION:

The filesystem returned error code 1017 for file ‘AMQRSYNA.DAT’.

ACTION:

Record the name of the file and tell the systems administrator, who should ensure that file is correct and available, for example that the current user has appropriate access to the file for reading or writing.

————————

 

The user needs to take appropriate action to resolve the issue i.e. rebuild the syncfile using rcrmqobj in this case. To rebuild the syncfile, check the Local Fix/Workaround section

 

The queue manager also generates the following failure data capture (FDC) record.

 

AMQ184577.0.FDC 2022/03/01 17:37:07.740247-8 Installation1

runmqchl 184577 1 RM738001 rflFindRecord Unknown(3F9)

 

Probe Id :- RM738001

Application Name :- MQM

Component :- rflFindRecord

Program Name :- runmqchl

Arguments :- -c “CHL9 “ -m “qm1

Major Errorcode :- Unknown(3F9)

 

The fix is targeted for delivery in the following PTFs:

 

Version    Maintenance Level

v8.1       8.1.0.15

The latest available maintenance can be obtained from

‘Recommended Fixes for IBM MQ’

http://www-1.ibm.com/support/docview.wss?rs=171&uid=swg27006037

If the maintenance level is not yet available information on

its planned availability can be found in ‘IBM MQ planned

maintenance release dates’

http://www-1.ibm.com/support/docview.wss?rs=171&uid=swg27006309

Temporary fix

 

Comments

APAR Information

 

 

APAR number                                    IT43831

Reported component name           MQ FOR HPE NS O

Reported component                      ID5724A3904

Reported release                              810

Status                                                  CLOSED PER

PE                                                         NoPE

HIPER                                                  YesHIPER

Special Attention                              NoSpecatt / Xsystem

Submitted date                                 2023-05-25

Closed date                                        2023-06-01

Last modified date                            2023-06-01

 

APAR is sysrouted FROM one or more of the following:

APAR is sysrouted TO one or more of the following:

 

Fix information

Fixed component nameMQ FOR HPE NS O

Fixed component ID5724A3904

Applicable component levels

Click Here to Read the Full APAR

By |2023-06-02T16:01:47-04:00June 2nd, 2023|Infrared360® Blog|

About the Author:

Go to Top