Saturday, February 17, 2007

Looking for Your Wallet Where the Light is Good

There's a joke about a drunk on his hands and knees under a street light, looking for his wallet. Guy comes by, asks where he saw it last.

"Over there in that alley."

"Why are you looking for it here?"

"The light is better."

I had a situation where that makes sense. It was the case of a Weird Cisco Malloc solved (by someone else).

We'd been having malloc errors that shut down one interface on a Cisco GSR 12008 router. Very intermittent, and started the day before Cisco announced three vulnerabilities. Also, it hit another router in the district (but did not affect two others). I found a lot of things we could do to protect the router, including recieve ACLs and requiring a TTL of 255 for BGP traffic to ensure packets actually came from a neighboring router. Didn't help. I was going nuts trying to find a way to dig out information from the linecard, including memory and cpu status. Never did find anything. I was on the point of borrowing a line card from someone else when a coworker pulled it to try blowing any dust off. Turns out there was some, but the main thing was a RAM module dangling loose. He seated the module and the problem went away.

Blows me away that two routers could fail from something like that at the same time...

Also blows me away that there would be no error messages or that the failure would be intermittent. And that line card status is so unobtainable. These things are computers in their own right. They should have status instrumentation.

The simplest troubleshooting algorhythm is sort of a problem of a binary search, divide the possibilities in half. I think I will consciously weight the expense of checking a problem vs. its likelihood. That is, in this case hardware still seems an unlikely explanation (two routers failing the same way? Experiencing the same rate of thermal creep? Not buying it!). But it's so cheap to check. If you have a spare line card. We didn't have a spare long haul GBIC but we could have looked at everything else very easily. In other words, we could have looked where the light was good rather than wade through the various contradictory and redundant Cisco MIBs trying to make the thing reveal its secrets.

0 Comments:

Post a Comment

<< Home