So I spent the last 2 weeks dealing with ridiculous Media Resource Issues after our upgrade. The problem was always the same… Mobile Agent gets a call and answers it to dead air. The issue is intermittent and happening at multiple sites. At first we think it has something to do with the 9.1.2. 12033-1 ES we have just installed on our CUCM cluster that is having the most issues in order to address other media resources hanging on routers after the upgrade to 9.1.2(SU1). That bug is CSCum05362 and we are still resetting resources on every router on our other 2 clusters every night to prevent disaster.
Another thing first, in our old 7.x CVP/CUCM we were using mixed codec with our agents even though it wasn’t technically supported. In the new environment it is now supported and wasn’t working very well. All normal hard phones seemed to be working OK, but mobile agents doing transfers were really struggling. Every site with any Mobile Agents is now G.711 all the way through to get us back functioning after the upgrade.
Now switching the entire call flows to G.711 at the mobile agents sites helped, but we were still getting random problems with agents reporting dead air, busy signals and call drops.
This obviously sounds like media resources to me, so I go to the most obvious place to find Media Resource Problems, RTMT. Now this is super annoying, RTMT has 0 MediaResourceListExhausted messages related to the sites with the issues. We went back and forth with the business and taking random guesses at fixes when issues would come up, but would eventually attribute our fix to more specific MRG settings. We are also looking at the CUSP logs and for some reason the failed calls aren’t showing up in either. This is very odd.
The first big break we got was from CVP Call Server Logs (with debug 41 turned on)… You can find these debug settings at http://localhost:8000/cvp/diag on the call server. This is the magic interface TAC use’s they never tell you about in the guides.
We go open up C:\Cisco\CVP\logs\Error-<DATE> and look around a bit and start seeing these guys coming up every few minutes… FINALLY OUR MISSING FAILED CALLS EXIST!
20737470: 10.20.35.123: Apr 22 2014 16:13:12.454 -0400: %CVP_9_0_SIP-3-SIP_CALL_ERROR: CALLGUID = CD24C8D2C98F11E3BF02CCEF48D39580 LEGID = CDBE97A2-C98F11E3-93C9AEF8-E3535D78 - [INBOUND] - ABNORMALLY ENDING - SIP code , Reason Hdr [Q.850;cause=47] Q850 Reason [47 - Resource not available.], GW call using SURV TCL flag [true], NON NORMAL flag [true], DNIS , ANI  with AGE (msecs) 670519 and Call History : 811111111529199|-1;3012912|1; [id:5004]
As a fun note we then go check in the TCD tables on UCCE and of course these guys actually show Call Disposition of 26 (U-Abort) U-abort in the IPCC Database Schema http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.373.1105&rep=rep1&type=pdf Handbook has this to say:
“In IPCC Enterprise, the Call Manager indicated the call ended due to one of the following
reasons: network congestion, network not obtainable, or resource not available. Such reasons suggest errors in media set up”
At this point its pretty clear we have a media resource issue, but it’s intermittent, the CUCM gives us no RTMT errors and in our particular setup we have hard MTP (G.711 and G.729) we have Transcoder up the yazoo and enough CFB to host a Facebook earning call.
This is where I decide to break out the CUCM traces, since I determine there has to be a resource problem regardless what CUCM says. I go take a look at the SUB our problem phone is registered too and go pull logs.
Now I personally am seeing something like this in the logs….
83428087.015 |21:41:45.820 |AppInfo |DET-RegionsServer::matchCapabilities-- savedOption=0, PREF_LIST, regionA=Oklahoma_City regionB=MobileAgents latentCaps(A=0, B=0) kbps=64, capACount=4, capBCount=1 83428087.016 |21:41:45.820 |AppInfo |DET-RegionsServer::sortMediaPayload-capCount=1, regionA=Oklahoma_City, regionB=MobileAgents, fkCodecList= 83428087.017 |21:41:45.820 |AppInfo |DET-RegionsServer::sortMediaPayload-capCount=1, regionA=Oklahoma_City, regionB=MobileAgents, fkCodecList= 83428087.018 |21:41:45.820 |AppInfo |DET-RegionsServer::handleMatchCapabilities()-- BEFORE MATCHING LOGIC applied(after filtering), sideARefCaps=1 refCapsSaveOpt=0 otherCapsSaveOpt=0 capsA::capCount=1 (Cap,ptime)= (11,60) capsB::capCount=1 (Cap,ptime)= (4,30) 83428087.019 |21:41:45.820 |AppInfo |DET-RegionsServer::handleMatchCapabilities()-- AFTER MATCHING LOGIC applied, capsA::capCount=1 (Cap,ptime)= (11,60) capsB::capCount=1 (Cap,ptime)= (4,30) numMatchedCaps=0 83428087.020 |21:41:45.820 |AppInfo |DET-RegionsServer::sortMediaPayload-capCount=1, regionA=Oklahoma_City, regionB=MobileAgents, fkCodecList= 83428087.021 |21:41:45.821 |AppInfo |DET-RegionsServer::sortMediaPayload-capCount=1, regionA=Oklahoma_City, regionB=MobileAgents, fkCodecList= 83428087.022 |21:41:45.821 |AppInfo |DET-RegionsServer::handleMatchCapabilities()-- BEFORE MATCHING LOGIC applied(after filtering), sideARefCaps=1 refCapsSaveOpt=0 otherCapsSaveOpt=0 capsA::capCount=1 (Cap,ptime)= (11,60) capsB::capCount=1 (Cap,ptime)= (4,30) 83428087.023 |21:41:45.821 |AppInfo |DET-RegionsServer::handleMatchCapabilities()-- AFTER MATCHING LOGIC applied, capsA::capCount=1 (Cap,ptime)= (11,60) capsB::capCount=1 (Cap,ptime)= (4,30) numMatchedCaps=0 83428087.024 |21:41:45.821 |AppInfo |MediaTerminationPointControl(16)::getResourcesAllocated -- No matching caps for either side A or side B, MTP not allocated
I have no idea how to read this and didn’t until after I fixed the problem however I can clearly see its MTP related. Now we have had so many MTP issues over the years we finally broke down and configured software sessions on each router for g729 AND g711 so we would NEVER run out of sessions. Also in CUCM 7.x the software MTP were pretty unreliable.
Our g711 profile looks simple…
dspfarm profile 50 mtp codec g711ulaw maximum sessions software 100 associate application SCCP
I go monitor the resources in RTMT and indeed they aren’t getting used when we test, but Software MTP on the CUCM are getting used. This is completely baffling because in our MRGL its something like the following. Please note MTP are on top of Transcoders because the transcoders will get used for an MTP instead if they are on top. We are sure they were getting used in the old 7.5 cluster.
- G.711 and G.729 router based MTP
- Transcoders and CFB
- Music on Hold (Unicast or Multicast depending on Site)
- Call Manager Annunciators, CFB and XCD
Turns out it’s dying intermittently when all the Software MTP’s are used up and there is none left to allocate. This is starting to make some sense… other then why my hardware MTP’s aren’t being used.
I am trying to understand why mobile agents are using MTP in general so I decided to go take a peak at the mobile agent SRND. http://www.cisco.com/c/en/us/td/docs/voice_ip_comm/cust_contact/contact_center/ipcc_enterprise/srnd/75/c7mobile.html
Here is where I find my golden ticket to the solution. Straight from an older 7.5 SRND.
DTMF Considerations with Mobile Agent
“MTP resources might be required for mobile agents who will be consulting an IVR or other network component that requires DTMF to navigate. The Mobile Agent feature relies on Cisco Unified CM CTI ports, which do not support in-band DTMF (RFC 2833). If the endpoints being used by mobile agents supports only in-band DTMF (or if they are configured to use in-band DTMF per RFC 2833), then Unified CM will automatically insert MTP resources because of the capabilities mismatch. If the mobile agent call flow requires in-band DTMF (RFC 2833), a sufficient amount of MTP resources should be made available.”
The odd part of this is that these are incoming calls to agents and have no need for DTMF but this has to be it. I decided its gamble time, let’s add pass-through to the mtp profiles.
dspfarm profile 50 mtp codec g711ulaw codec pass-through maximum sessions software 100 associate application SCCP !
A few test calls later and RTMT confirming we have the issue fixed, we are using the hardware MTPs. A show sccp connections command on the ISR with the MTP’s configured will show them actively in use.
TEST3845GW01#sh sccp conn sess_id conn_id stype mode codec sport rport ripaddr conn_id_tx 151156076 151790214 mtp sendrecv pass_th 20036 22862 10.10.233.1 151156076 151790213 mtp sendrecv pass_th 16612 21562 10.120.9.9 151156125 168054597 mtp sendrecv pass_th 22330 25234 10.120.9.40 151156125 168054596 mtp sendrecv pass_th 16620 16692 10.98.34.4
I am still a little angry by this, clearly a feature change or something in CUCM but I am able to at least do some intelligent searching now. I search something like MTP DTMF pass-through and this blog shows up about titled “When MRGL Order Doesn’t Matter (CUCM 8.5+) located here: http://www.internetworkingerror.com/2013/03/when-mrgl-order-doesnt-matter-cucm-85.html
Long in short this is a great blog and goes through all the steps I did and comes to the ultimate conclusion this is a new feature!
As said in the blog – “According to Cisco, this behavior is expected but documented in bugID CSCtw77944 – Codec pass-thru MTP is preferred over non codec pass-thru higher in MRGL. The bug appears to affect CUCM 8.5-9.0(1). CUCM will prefer a passthrough MTP (CUCM Software MTP) over a MTP without passthrough so that SRTP or T.38 can be negotiated later in the call. “
This one still leaves a lot to be figured out. CUSP should show failed calls and RTMT surely should alert on having no resources available. The other linked blog also contains some great information on the CAP Codes in the MTP traces. I just wish I would have found that blog 10 days early and saved myself about 60 hours of troubleshooting and tail chasing.