Sunday, September 13, 2009

A Scala success story: commercial usage of Scala at Capital IQ ClariFI

Update: There's some additional comments and discussion on reddit. Also, made minor edits on our company background to reflect our recent rebranding.



For a while I've been meaning to put together a post discussing my experience using Scala successfully in a large commercial product.

First, a bit of background: the company I work for, Capital IQ, delivers fundamental and quantitative research and analytics software to the investment management, banking, corporate, and academic communities. Our flagship quant product, ClariFI, supports asset managers at every stage of the quantitative investment process: building and backtesting of factors, portfolio optimization, simulation of trading strategies, performance and risk attribution, overall data management (including organizing huge amounts of time series data pulled from a diverse set of raw sources), and lots more. In addition to ClariFI, though, we have started serving up some of our analytics to the Capital IQ web application.

How did we end up using Scala? About a year and a half ago we had a legitimate business case for rewriting—or at least substantially redesigning and refactoring—the core analytics backing our portfolio attribution (PA) workflow. I was slated to be the person doing this rewriting/redesign, and as my boss Scott was describing to me how the existing ModelStation PA worked, it became clear to me that a functional language was a natural fit for the domain. I'd been pushing for a while for us to try using languages other than Java—at least for some projects that were more isolated from the rest of our codebase—and this project seemed like the perfect opportunity. After doing some additional research we settled on Scala.

The decision to write in Scala implied this would be a true rewrite and I would be totally unencumbered by the previous architecture. Rewrites aren't always a bad idea, though. Our new Scala-backed PA exceeded the functionality of the old codebase, using about a third of the code of the old Java implementation, with comparable memory and speed footprints. The new codebase was more generic (components developed to support PA are now being reused elsewhere), more modular (the core PA engine itself is now being accessed by two completely different front ends), and much more testable. I made extensive use of ScalaCheck—an absolutely huge boon that dramatically cut down on the time and effort needed to find and fix bugs. In developing the new PA, Scott and I had conversations in which I would ask him for a property to specify a piece of code I needed to write, and then I would transcribe his description, fairly directly, into a ScalaCheck property. Originally, we were thinking of porting over Scott's old hand-made test cases from the previous PA, but we didn't end up doing this because our properties gave such good coverage. Having such an extensive set of tests also gave me a lot more confidence to make later changes to tweak the design, knowing I was not breaking anything in the process.

Part of the reason the code was so testable was that close to 100% of the Scala I wrote was purely functional, exceptions being the occasional use of local state whose effects did not escape function boundaries and "benign" uses of state that preserved referential transparency, like memoizing a function. Writing stateless code, especially within the analytics layer of our product, is extremely natural and doesn't require any contortions. And stateless code is very easy to test.

Scala is an interesting language. It doesn't really impose any particular view on you of how to write software. Everything from higher-order functional code to low-level imperative programming is well-supported and moving between these paradigms is seamless. On the other hand, there's so much to the language, and the features aren't all totally orthogonal, that it sometimes feels a bit clunky to me. But it is without a doubt an exceedingly practical, powerful language. Let me highlight a couple aspects of Scala I think have been important for us:
  • Integration between Scala and Java is trivial: If you were so inclined you could write "Java in Scala" and have it compile to essentially the same classfiles that Java would compile to. You can set breakpoints in Scala, step into Scala code from Java, call any Scala method as if it were a Java method and vice versa, profile Scala code using the same tools you'd use for profiling Java code, etc. This level of integration was important for us since we were interfacing our Scala code with a large Java application.

  • Basic functional idioms: Without a doubt, there's simply a huge productivity boost that comes with having access to first-class functions, convenient syntax for annoymous functions, and the various list processing functions like map, filter, fold, zip, etc. Scala's Option type is extremely useful as well, as are for-comprehensions.

  • The type system: It's quite good. I'm still exploring its limits. In particular, though, Scala's generics plus implicit parameters enable powerful libraries like ScalaCheck and SBinary, both of which we have used extensively.

  • Optional non-strict evaluation: One of our projects involved running a whole series of memory-hungry calculations whose results were then encoded as XML and streamed across the network. We couldn't construct all the results up front as that would consume too much memory, and we didn't want to rewrite the producer and/or consumer to have to tightly coordinate with the other. By constructing the results lazily via judicious use of Scala's by-name function parameters, the producer (the calculation) and consumer (the xml writer) remained ignorant of one another, and memory usage was nearly the same as if the producer code and consumer code were explicitly interleaved. Not that this was news, but laziness really can improve modularity.
Some other things weren't as nice:
  • At the time of this first project, the only developed IDE support was the Eclipse plugin which was extremely buggy and limited in functionality. Workable, but not good. Since then the situation has improved considerably.

  • In Scala 2.7.1 (the version we used up until about 6 months ago), Scala's parameterized types did not show up as Java parameterized types. This was actually rather annoying to deal with, especially given that I made frequent use of parameterized classes and methods. It wasn't until Scala 2.7.4 that generics support was actually bug-free enough for us to use. Even still, though, Java's lack of both type aliases and type inference makes dealing with generic types extremely painful.

  • Scala's type inference, while better than nothing, is quite limited. Method parameters always require type annotations. Higher-kinded type parameters are not inferred (although I've heard that there might be some form of inference of these parameters in 2.8). The inference algorithm seems unpredictable to me... my usual strategy is to start without type annotations, try compiling, fix errors by adding annotations, repeat (of course, I sometimes annotate types for clarity anyway). To be fair, the situation here is still much better than Java, but not nearly as nice as (say) Haskell.

  • Functions aren't curried by default, and I find working with curried functions kind of cumbersome in Scala. There are also some language warts that make it pretty much impossible to program in a point-free style (namely, the somewhat arbitrary distinction between methods, defined using def, and function values which implement one of the FunctionN traits). This might not be so terrible if you didn't have to fully annotate parameter types in method definitions, but you do, and it can get kind of annoying. Again, we are still much better off than we are in Java, but it's possible to do better (for instance, Haskell).

  • The standard library is kind of uninspiring and is missing some basic stuff. It also has some warts—for example, the function zip is defined for List but not for all sequences, there's no zipWith function, no scanl function, etc. Of course, you can write these functions yourself, and we did. The collections library is also getting revamped for 2.8 and a lot of these things are getting cleaned up. There's also an interesting project, Scalaz, that we haven't gotten around to using in any production code—but it's a nicely structured library that fills in a lot of the gaps in Scala's standard lib. In general, I don't want to dwell on this issue too much since every language's standard library has warts and Scala's is still certainly much saner than Java where they overlap in functionality (sensing a theme here?).
Overall I found these shortcomings to be more annoyances than showstoppers and I would certainly recommend Scala to anyone targeting the JVM. At our company, I know we will continue to expand our Scala usage: since shipping the Scala-backed PA, we've used Scala for one other major project in production and have several others either in progress now or in the pipeline. So we're pretty much hooked.

Takeaways


Typeful programming is a huge win, but it does take some getting used to. When I started on this first Scala project, I'd used Haskell for some toy projects but otherwise didn't have much experience with statically typed functional languages. There is a certain sort of brain-rewiring that occurs as you become proficient with the type system in a language like Scala—you reach a point where types become ingrained in how you think about and structure code, and the type system becomes something you work with, rather than against. Types are also an extremely concise way of communicating and documenting designs.

Another takeaway is that (no surprise), good design takes a while and is extremely hard, in particular getting the details right. Scott and I had a vague sketch of a design after a few hours of whiteboarding, but nailing down the details took much longer. There are always a lot of small design decisions beyond what you uncover whiteboarding. In general, I've become a lot less trusting of whatever initial ideas pop into my head when working on a design, and I prefer now, where possible, to give myself time to marinate on a problem. Several times I've had the experience of realizing something, months after the fact, which had I realized several months prior would have saved me a ton of work and made the code much simpler. This is very humbling, and it's why I think doing prototyping and giving yourself time to explore the problem domain often yields greater productivity in the long run.

Finally, I'd like to scrutinize the folk wisdom that rewrites are generally a bad idea. The argument goes that old, battle-tested code has accumulated scores of bugfixes, corner-case handling, and so forth and rewrites mean throwing away all of this instantiated knowledge. Rebuilding this knowledge in the new codebase is often more time-consuming than it's worth.

The experience with rewriting the PA has led me to suspect that this argument is more applicable for code that is underspecified and difficult to test. When code is testable and well-specified, the cost of discovering bugs and handling corner cases drops substantially... to the point where they are no longer such a huge time sink. I think this myth arises because most code is not very testable—such code tends to result in bugs being discovered much later in the development process when the cost of investigating and fixing them is much higher. When considering a rewrite of some codebase that has been this costly and time-consuming to battle-harden, it's easy to assume you will have to incur the same costs when developing the new codebase. But this is only true if you keep using the same old, busted development strategies.

Rewrites aren't always good: if you are going to rewrite code using substantially the same design, using the same language and tools, just to "clean up the code", then you are just code polishing and are not really adding any value. But if you are redesigning the code with an eye for making future work much easier, then you are simply trading off some current throughput for an increase in long term productivity. And this is a completely sensible thing to do.

4 comments:

danielgpratt.com said...

First of all, interesting article, thanks. Of late, I'm very interested in cases of newer, functional languages being used in a commercial setting.

Were you committed to using a language that targets the JVM, or did you consider other platforms? I've been curious, for example, about how Scala stacks up against F#.

Paul Chiusano said...

@danielgpratt -

Running on the JVM was a near-requirement, I say, just because the code was going to have to interoperate pretty tightly with a Java app.

I can't really comment on Scala vs. F# - F# does seems like a nice language but I don't have any experience with it. With us being a Java shop it was not a realistic option. :-)

Jesper Nordenberg said...

Good article, I agree with pretty much all you write (although I would classify Scala's type system as excellent and improving each day :) ). Rewriting, especially when you've written the original version yourself, is often a good investment. Like you write, good design is hard and it's not often you succeed on the first attempt. Maintaining a code base which is well designed is a huge cost saver in the long term.

About your not-so-nice-things-about-Scala, pretty much all of them except currying (which I personally don't miss that much) are/will be resolved.

globe promo said...

Great article!
I was hesitant to learn Scala because as far as I can tell. Companies are not using Scala (yet) because it's a new language. But it gave me light realizing how good scala is considering the tradeoffs and good points you'll get when you port your code. So far, the only large high traffic site I know that uses Scala is twitter. I guess the trend of Scala to be used by Enterprise companies will definitely skyrocket in the future