Covering Scientific & Technical AI | Friday, January 24, 2025

So You Want To Be A Financial Services HPC Programmer? 

Financial services companies care about three things: Money, security, and programming on the fastest systems they can afford. You can't be a competitive bank, brokerage, trader, or hedge fund if you don't have all three, and ultimately, these all come down to money, one way or another.

At the recent HPC on Wall Street conference in New York, some high flying programmers in the financial services sector talked a bit about programming and what they think is important when it comes to squeezing the most performance possible out of infrastructure. A keynote session about HPC code writing was hosted by Jeffrey Birnbaum, the founder and CEO of 60East Technologies, which makes messaging middleware for financial services companies, and before the talk got underway, he polled the crowd and interestingly, about half of the people in the hall were programmers.

"If you are a programmer, the landscape of how you do your job has changed in dramatic ways – and will continue to change," Birnbaum said.

No one knows this better than the London Stock Exchange, which prides itself on having the lowest latency among the dozen major exchanges operating around the globe today. Moiz Kohari, vice president of advanced platform engineering at the LSE, hinted that the exchange was working on a new low latency scheme for connecting clients to the market. While it is important that the LSE provide a low-latency, jitter free environment, the Kohari explained that the market matching engine business represented less than half of the LSE's business and that it clearance, custodial, and deposit services and hooks into many of the central banks in other markets as well.

"On the front end, it is absolutely critical that we end up being ultra-low latency and jitter-free," Kohari said. "When we speak to our clients, it is not really about latency, but the jitter is where it ends up hitting pretty hard. If you look at the entire stack – operating system, applications, and hardware itself – you need to try to figure out where your jitter comes from. What we found is that the TCP/IP stack ends up causing the most jitter for us. We are changing the entire market interface to come up with ways of allowing clients to connect over memory-based interfaces. When you come in using memory-based interfaces, you are coming in over a lossless fabric and you end up providing a very low latency, jitter-free environment.  This actually really changes the game."

Kohari did not get into specifics of this memory-based interface, after the keynote told EnterpriseTech that it was indeed direct communication into the server processor memory complex on the systems in the matching engines. He added that the New York Stock Exchange and the NASDAQ exchange, to name two of the LSE's big rivals, had front-end latencies on the order of 250 to 500 nanoseconds, which is fast in terms of the kind of port-to-port hopping you can expect in an Ethernet switch running in a datacenter, but that the LSW was already down to around 125 nanoseconds before the advanced platform group at LSE started looking at this memory-based interface that is under development. "Our intent was to improve jitter control, but you also end up reducing latency as a by-product," Kohari said. How much lower the LSE can push it, Kohari was not at liberty to say.

At the LSE, the front-end programs that financial services firms plug into to do their trading are written in Java, C, or C++ and Kohari said that it was critical to match the right hardware to the write job for those applications to maximize performance. If a job really only scale across six threads well, it doesn't make much sense to put it on machine that has 15, 16, or 18 cores and 30, 32, or 36 threads, as the current top-end Xeon chips from Intel do. Having said that, at some point, an application has to reach outside of the processor and memory complex for data, and the latencies to data outside of this area will matter too.

What is obvious from the above statements is something that in an increasingly virtualized computing world is hard to see: hardware. The techies on the financial HPC coding panel said it again and again in many different ways, but they all were saying the same thing, that programmers really need to understand iron to do high performance programming.

"Most of the best programmers that I find are either self-taught or come from an electrical engineering background, not a computer science background," said Birnbaum. And to be a high performance programmer, you have to know the data structures of the program and the cache structures of the underlying chips and make sure they mesh properly.

"Physics. I love physics programmers," said Brian Bulkowski, who is founder and CTO of Aerospike, a NoSQL database that has been optimized for memory and flash. Aerospike can handle 2.5 million transactions per two-socket server node based on Intel's latest top-end "Haswell" Xeon E5-2699 v3 processors without resorting to in-memory processing. In that test, 99 percent of the transactions finish in under 1 millisecond.

Back in 2000, Bulkowski explained, he came to the realization that transistors were not going to get any faster and that cores were not going to get any faster, and it changed the way he thought about programming. Aerospike is coded completely in C, and that is done to get as close to the iron in as simple a fashion as possible in an effort to boost performance.

"I stopped thinking about operations and I started to think about cache and cache locality because the name of the game started being memory and memory speed. So instead of having to think about this number of lines decomposing into this number of operations, I started to think about when is the last time this particular byte was accessed, how frequently was it, and where is it going to be in cache?" A missed branch prediction – which forces the processor to go out to L3 cache to get a bit of data to chew on – is ten times more expensive in Linux than grabbing the data that is already lower in the memory hierarchy.

"Performance and whatever latency component you are working with, is moving closer and closer to the CPU<" said Dan Lynn, technical strategist for CodeFutures, a NewSQL database provider. "There is this explosion in cloud computing and cluster computing that works really well for a variety of analytics, where you have vast amounts of data but it is not rapidly changing data and you are trying to analyze it in a short period of time. But you are still subject to the latencies in the networking interfaces in the cluster, and you have all of the additional reliability concerns. So what you see is all of these parallel concepts being moved inside the box as you get closer and closer to the CPU or multiple CPUs. A lot of the strategies you apply to networking you end up throwing inside the box as well. You think less and less about internode communications and where are your variables being stored in RAM. That takes some pretty low-level tuning. It is not that necessarily that everybody in your organization needs to know how to do that."

Lynn added that shifting from this single-threaded serial programming to parallel programming across nodes was for a lot of programmers was a "real challenge." While tools have advanced in the supercomputing and enterprise spaces to make parallel programming easier. It is, like painting the ceiling, something that is probably going to be difficult for serial thinkers like most human beings are. "It is definitely a big mindset shift, and it requires a lot of study for a programmer . You can't take the skills you had from 1999 or 2007 and have them be quite so applicable now."

A big problem that financial HPC programmers have is the lack of tools. Birnbaum pressed the panel for the tools that they use to do their high performance and parallel programming, and Kohari said that the LSE had to create a lot of its own tools to goose the performance of the code. Various dynamic tracing tools for Linux were cited as being useful (KTAP is the obvious one), as was OProfile, a code profiler, and perf-top and NumaTOP, the latter being open sourced by Intel to help with memory locality in NUMA-based tightly clustered systems that have a shared memory space.

"Verification is a big, big deal and we don't have tools for that," explained Birnbaum, and Bulkowski added that while unit testing is important, it is not particularly useful when it comes to performance tuning. "They don't help you get to performance because when you are chasing performance issues based on cache locality and NUMA awareness, there is no gap. So what we have to do is build it up." Birnbaum said a few nice things about Google thread-sanitizer, a fast data race detector for  C, C++, and Go.

Understanding the hardware is getting more and more important as processors, networks, and systems get more complex and applications are interdependent. But a good programmer needs more than that. Kohari quipped that a good programmer "drinks a lot of coffee, stays up late at night, and has tenacity," while Lynn said the most important thing was the speed at which a programmer learns. "The two best programmers I knew – one that I worked with, and one at my company today – they don't talk very much," said Bulkowski. "They sit down. They think about things. They write a lot of code."

AIwire