## Software complexity

How does the size of software scale with the number of features? Some folks assume that the scaling should be linear:

With his newfound power, he built his operating system on top of his own hardware from scratch in 12K SLOC, with a footprint of 200 kilobytes. For comparison, OSX runs in on ~86M SLOC with a footprint of 3 gigabytes, built by one of the wealthiest companies in the world. Now, perhaps OSX is more feature complete than Oberon, but certainly not by a factor of ~40 000X. Something was lost along the way.

Fredrik Holmqvist, Brooks, Wirth and Go.

This might be true if each feature were modular or independent, like adding a new separate application. In that case, adding a feature only requires adding new code to support that particular feature, without changing other code for preexisting features.

However, there are other ‘features’ which aren’t modular. For instance, consider adding the ‘security’ feature. Security is one of those attributes that can’t realistically be slapped onto an existing system — it may require a total rethinking of many core design choices. Now consider a feature list such as “multi-user, secure, multi-language, multi-processor, performant, …”. These are all attributes that a homebrew OS probably doesn’t have, but a mainstream OS needs.

The majority of your complexity budget is burned on these things

Hillel Wayne Reject Simplicity, Embrace Complexityinteracting.

Adding each of these ‘features’ might *multiply* the difficulty of the problem and the complexity of the solution, because each feature interacts with others. That would be an exponential scaling. Taking the ratio of the logs of the size of the two systems yields 1.78, which would imply that OSX should have 78% more ‘features’ than the homebrew OS. This is super hand-wavy and likely pessimistic. Maybe the scaling should be sub-exponential, because presumably not every feature interacts with every other one. The point is that OSX only looks bloated if you assume linear scaling, which is almost certainly too optimistic. It’s possible that most of the complexity is inherent complexity, not accidental complexity.

## Probability of a software project running late

In an interesting blog post (Why software projects take longer than you think: a statistical model), Erik Bernhardsson shows that to a decent approximation, the time to complete a software project formed a log-normal distribution around the estimated completion time. That is, the completion times behave as if you sampled from a normal distribution and then took the *exponential *of that number, and scaled the actual completion time by that random factor. This is somewhat surprising, for the following reason.

One way to estimate the time it takes for a project: breaking down work into chunks, estimate the time it takes to complete, and then sum up the times for the chunks. If you break the chunks small enough, the central limit theorem says that the error in estimating the overall completion time should be a normal distribution, regardless of the shape of the distribution of the errors for the individual chunks. So, it might come as a surprise that the error distribution is log-normal instead.

Bernhardsson notes the issue I raised in the previous paragraph in a footnote, but doesn’t speculate about the implications. It might be related to the exponential nature of the complexity of software as a function of the number of features. If that is correct, then the normal distribution inside the exponential would represent the number of features that the project needs to deliver.

At a lower level, software projects are more aptly thought of as tree structures of tasks and subtasks than lists of steps. Recall the parable of yak-shaving: there are many seemingly-irrelevant details that become critical as one zooms in on the reality of accomplishing even a mundane task. See also Reality has a surprising amount of detail. My thought is that the estimation error isn’t due to estimating the duration of the steps. Rather, it’s caused by adding or neglecting certain tasks (along with all their subtasks, if any). Examples of neglecting a major task could include an “unknown-unknown” technical problem that crops up during development, or failure to identify all the stakeholders and their requirements.

A project can be modeled as a random recursive tree, where each note is a task or subtask. The estimation task would be modeled as adding or dropping branches at random, to produce the estimated project structure from the actual project. We would like to know the probability distribution for the size of the actual project given the size of the estimated project, which also requires us to know the true distribution of project sizes. It might be easier to work the other way, by postulating a model for the process of going from the estimated project to the real project by adding branches with some probability.

It would be interesting to run some Monte Carlo simulations (for instance, using RandomTree.jl) to see what assumptions about the trees and the pruning/growing probabilities it would take to reproduce a lognormal-like distribution. It would also be interesting to get real-world data on the estimated and actual work breakdowns for comparison. Fields with less uncertainty in the project structure (residential construction?) might have something closer to a normal distribution for completion times.