The micro-burst that killed my network

The IT department at the stock trading firm was under serious pressure. Trading times, carefully monitored by sophisticated monitoring systems, were becoming increasingly irregular – latency was creeping into the tens of milliseconds, twice as long as previous weeks, and certainly slower than competitors’ systems. Even without a glance at trading times the brokers knew something was amiss – revenue was dropping as they suspected their trades were being executed just behind others beating them to the markets.

But things didn’t add up – outside their company walls they were running at 50% of their available access link bandwidth, and network latency measurements showed far less than what the trade monitoring system reported. The same was true in the LAN and in the data center – each appeared to be working at peak performance. There was something they didn’t see, and it was costing them millions.
Today almost half of all trades executed globally are initiated and completed by computers, not humans. These algorithmic trading platforms, as they are known, constantly scan international markets for price discrepancies that offer a nearly instantaneous, guaranteed return to financial institutions that can move near the speed of light to buy and sell across global markets. It’s a proven and increasingly important strategy, but you have to be fast.
Although stock trading is an industry that is severely affected by network performance issues, many other verticals and are similarly affected – anything transactional that involves time or money feels the impact of delays, packet loss and capacity issues. Today we’re not just running networks that have to keep up with the speed of business, they define the speed limit.
So what was happening back at the broker? What were they missing? Knowing nothing had changed inside their company walls, they turned to their service provider for help.
The Microburst Phenomenon
Their operator, specialized in serving financial markets, had seen this scenario before, and luckily their networks were well instrumented. They had visibility on per-flow, one-way latency to microsecond resolution, but their measurements also turned up good results. That’s when they knew it was time to look deeper. By increasing the granularity of their utilization monitoring to a per-second basis, they discovered that even though bandwidth utilization appeared normal over their standard, five minute monitoring intervals, there were microbursts of data that went well beyond the commissioned bandwidth – up to 140% – if even momentarily.
They were measuring these traffic stats just before the trader’s data entered their network, before their network interface devices’ (NID’s) regulators, so they could see the micro-bursts that were resulting in very short-term packet loss – just a few frames dropped as the peaks hit. This small, almost negligible loss meant trading time would nearly double – the missing packets needed to be retransmitted to complete the buy or sell request, adding precious milliseconds to the transaction.
How did they fix the problem? There are a number of ways to react when you know micro-bursting is affecting application performance. A simple solution would be to increase committed or excess bandwidth or burst size limits, another would be to shape and smooth out any traffic not sensitive to latency or jitter sharing the same link. With advanced traffic monitoring, classification and per-flow conditioning at the ingress to the network, either the service provider or the end-user can optimize their service flows for bandwidth efficiency, performance, or a combination of both depending on the each application’s requirements.
Micro-bursts are not a new thing – they happen all the time, over all kinds of networks, and affect a wide range of applications. We just don’t notice them because of their highly transient nature. The evolution of Ethernet & IP / MPLS performance monitoring has reached a point where not only are measurements highly precise, they are also highly granular – we now have the tools we need to detect these short-term events and take action to ensure applications are running at peak performance. With these capabilities now available in cost efficient NIDs and monitoring platforms as a standard feature – they are easily in reach of both service providers and their customers. When performance puts your business ahead, you now have the ability to set your own speed limit and avoid those million dollar tickets.