I was talking with a customer this week about an all-hands call they had for a performance issue within one of their applications. After working in corporate IT 20 years, I can easily say I don’t yearn to be on those calls, hastily looking for answers while an army of managers wait anxiously for anybody to speak-up with a possible resolution. Even worse was the helpless feeling of being part of the manager militia. Despite the indifference for the all-hands calls, I do miss the problem solving.
As I was talking about the issue, my mind started to recall all the ways I had seen performance issues solved. After I left the building and for the days to follow, my mind was still traveling through the issues I had experienced. The customer’s issue disappeared and wasn’t present the second day; however, experience has taught me that it will come back.
I’ve always been a fan of the top 10 system admin truths (https://iiseblogs.org/2014/10/22/top-10-system-administrator-truths-updated/) so I thought I would create my own performance troubleshooting truths. This list is part brain dumping ideas for that customer and part self-treatment to get them out of my head and onto paper.
This list is in no particular order:
- Monitoring tools can give teams a false sense of confidence- mainly, because their averaging/sampling nature hides issues.
- I learned this truth while helping to troubleshoot a user complaint of many slow applications at some of our remote sites. When we talked to the networking team responsible for these networks, they quickly showed us graphs and reports that displayed stellar utilization. After being dismissed by the team monitoring the links, we created a process to ping and record the latency to the default router IP. What we found was that were extremely high latency spikes in the morning ( > 400 ms). After the big spikes, the problem would go away.
The issue was people turning on their computers and Outlook downloading all the messages all around the same time. We resolved the issue by adding some QOS policies for the Outlook traffic. In addition, the latency graphs we presented also helped to increase the budget for increasing the network bandwidth at the offices.
- Look for applications excessively making disk reads because reading from disk is slow.
- Due to recent improvements in disk technology, this truth is starting to become less true. Flash storage combined with technologies like NVMe continue to half the latency to data. However, it’s still not as fast as memory.
I’ve seen this truth manifest many times with DBAs fixing performance issues with indexes on commonly read columns. I’ve also seen this with web servers. We had been receiving complaints from customers about our primary web application being slow. Every time someone would go to check the site, it loaded just fine. Since the complaints persisted, we created synthetic transactions that would login, edit some fields, and logoff. What we found was that after more than 30 minutes of inactivity, the next transaction would be slow.
We discovered the issue was that the site wasn’t heavily used, and Microsoft’s IIS would unload files from memory based on default settings for inactivity and size of files. This was causing IIS to pull the files from disk, which created the slow application access.
Our resolution was to change the file size registry keys, edit the inactivity timers, and keep the site warm with the synthetic transactions.
- Watch recent database queries during a performance issue.
- I say during the issue because logging all the queries can be excessive.
For application architectures that use a DBMS, I’ve seen many performance issues found by looking at the recent queries. Most of the time, DBAs can fix the issue by masking inefficient queries with indexes, but I have seen business intelligence teams running queries on production tables not realizing their queries are creating blocks.
- Ping is your friend.
- In addition to the example in the first truth, I have two more examples of where ping helped us find an issue that our other tools did not.
The first one was an application slowness issue where we were seeing slow responses coming from the database. The DBAs were involved early and weren’t finding blocking or indexing opportunities. Performance counters showed around 25% processor utilization and bandwidth lower than 20 Mbps on 1 Gbps interfaces. We mainly used these to see if we had erratic connectivity (see system admin truths – switches die one port at a time) we started continuous pings and were seeing latency so high some responses timed-out.
This issue was that the older server did not have any type of multi-pathing features and one processor was taking the network I/O load. It was a 4-processor server and the 25% was one processor over-loaded. The resolution was we created an 802.3 LACP trunk across four NICs and balanced the network load across the processors.
Most recently, I had a customer talk to me about multiple Outlook hangs. We discussed the usual culprits; He implemented some best practices, but the hangs persisted. I asked him to run continuous pings to the Exchange server and record the latencies and what he found was lost packets. After showing his results to their network team, they were able to discover flapping on the port which provided connectivity outside of their site. The network team resolved the flapping and the hangs ceased.
- Queue performance counters point you in the right direction quickly.
- A quick example is as follows: Wwe had been receiving slowness complaints from users of an imaging application. After looking at many of the performance counters we noticed high CPU queue values but not high utilization. The applications architecture forced us into a single application server (we couldn’t scale-out). This issue was that there weren’t enough processors servicing requests. This one was an easy fix by adding vCPUs.