Again a service has been completed. The problem was found and the client is happy. For the customer it was a long ordeal, because the problem has already been moved for a while between the supplier of the software, the server administrator and the application and network managers. Because the environment is no simple client to "single server" application, but much more complex, we chose the path of the expulsion proceedings. Complex was: ESX Server on 2 blade systems, each with 2 virtual servers and multiple single distributed application services. All together in an MS cluster that could be distributed from VM and MS cluster view. Of course, everything connected double redundant.
We used GeNiEnd2End Application to make the sporadic problems visible which were described by the users. In addition, we used GeNiEnd2End Network to get a view on the network infrastructure from the perspective of the end-to-end quality.
As most common, it is rarely just one big problem, most of time there are a lot of small problems, which become to a big problem.
So we could see that the application had sporadic longer transaction times. But the network was not flawless, too. The central WAN route had packet loss. Moreover, the measured application transaction came to a standstill because local and remote tailed workstations lost 1-2 packets and as a result they needed 2.5 seconds instead of a second.
The WAN line was tuned by a new router, which resulted in a significant improvement of the usable bandwidth, but also to omission of all packet losses.
To restrict the cause and location of the packet loss in the LAN, GeNiEnd2End MultiTrace has been used. So it was now easy to perform a 3-point clamp measurement and investigate a problematic transaction. The clamp measurement was placed on the VM guest, the server switch via SPAN and on the client. The investigation showed that the packets drop away between the switch and server VM host.
These losses were also confirmed by the measurement with GeNiEnd2End Network. Here was also the packet loss over the whole period of measurement between server switch and vServer measurable.
There followed several attempts by the server people to isolate the problem further. At this point only GeNiEnd2End Network and GeNiEnd2End Application had the control - by these tools, we received an analysis with minimal effort in the single digits of minutes!
All other changes, e.g. use vMotion, to test the other blade or the switch within the MS cluster, were unfortunately unsuccessful.
Finally we have got an improvement by replacing the virtual network card inside the guest VM.
The measurements have now shown that they have no packet loss within the VM anymore.
Thus, this problem has been identified with comparable little effort and ultimately resolved in the right department.