My first experiment with neural networks

I finally tested the idea I proposed in my last blog post: penalizing the use of the activation function’s nonlinearity, to see if a ‘deep’ neural network could be compacted into a shallower one. I decided to go with Keras because I heard that it was user-friendly. After reinstalling Anaconda to get the latest everything, and deleting the OpenMP dll files that are packaged with Keras b/c they collided with the base version, I was able to reproduce the Keras tutorial for using a convolutional neural network on the MNIST datataset. I spent a few hours playing around with variations on the CNN and inspecting the weights & activations first, to get a sense of things.

It didn’t take me as long as I anticipated to get to the point of testing my idea, because Keras already has the option to provide activity regularization functions — I didn’t have to go spelunking through the plumbing to set this up. I tried a handful of regularizations:

ReLU(x)
ReLU(-x)
ReLU(x+1)
ReLU(-(x+1))
ReLU(1-abs(x))

The idea here was to push the activations all to one side or the other of zero, to make the layers behave effectively linearly.

I combined these with a few different architectures, all dense & all followed by a softmax:

N x N x N x N, for N = 256 and 1024
N x N/2 x N/4 x N/8 for N=1024

I tried uniform weights for each regularization, but also some graded weights, generally with stronger weights on the layers deeper into the network.

I also tested some L1 & L2 norms and some of my ReLU-based forms as kernel regularizers.

I found that the activity regularizers were successful in shifting the activation patterns, but the training/testing results were underwhelming. Interestingly, the kernel regularizations tended to promote bimodal activation distributions.

Here’s a plot of the fraction of activations above zero in a batch, taking a histogram of that ratio over the nodes in a layer. This was for a test with the ReLU(x+1) regularization with weight = 1e-4 for all (hidden) layers. As expected, the activations are skewed toward zero.

The validation error is poor:

Here’s the mirrored function ReLU(-(x+1)):

The error is even worse (not shown). At last, with the “wedge” ReLU(1-abs(x)):

Here’s the result from one of the kernel regularizations (L1 with weight = 1e-4):

The error:

Finally, the original CNN:

I didn’t do much in the way of parameter tweaking, but I’m not seeing much promise here, even with the graded-weight versions that I thought might be the most effective. Maybe some kind of scheduling would be worth trying, but I’m doubtful at this point.

Bivariate is all you need

Imagine trying to learn the statistical regularities of video data streams. If we had to search for regularities by considering the probability distribution over every possible subset of pixels & their value histories, learning would be prohibitive. Obviously, one would first consider local subsets (adjacent pixels & short time histories). However, that still leaves an intractable number of variables. It also can’t explain learning things that take place over extended periods of time or space. We need to assume that all multivariate correlations can be discovered by detecting bivariate correlations alone. This reduces the size of the search space from factorial in the number of variables to merely quadratic, while also reducing the amount of data necessary to accurately sample distributions of interest. Improvements over quadratic scaling could then come from locality biases (without requiring strict locality limitations).

How can we be sure that we will not miss a trivariate (or higher-arity) correlation? Consider the simplest case in which it is theoretically impossible to discovery a trivariate correlation from constituent bivariant correlations: the relation Z = XOR(X,Y), where X,Y are uniformly and independently distributed binary variables (think about a two-way light switch). In this special case, there are no bivariate traces of the trivariate dependence.

This is a very contrived scenario. Any deviations from an independent, uniform distribution for the joint variable (X,Y) results in non-zero bivariate correlations. We can thus detect the trivariate correlation via the presence of two bivariate ones that share a common variable, without directly having to search for trivariate correlations. In practice, the fact that our sampling frequencies won’t in general be equal to the ideal probabilities ensures that such contrived situations are totally irrelevant. At a higher level, one can view this as a mechanism to benefit from distributional shifts. which tend to confound contemporary deep neural networks, much as dither noise is useful for improving the performance of analog-to-digital conversion.

For example, suppose that we happened to experience a run of X = [0,0,0,0,0] while Y = [1,0,1,0,1]. In this sampled distribution, Y and Z will be perfectly correlated. This would alert us to the presence of a correlation between them. If we then experienced the same series of values for Y while having a run of 1’s for X, we could see that the correlation between Y and Z was reversed. We might thus hypothesize two cases: (1) when X = 0, Y = Z, (2) when X=1, Y = NOT(Z), which correctly explains the XOR.

Well, you also need experiments

If we have a fairly low threshold for detection of correlations, then we’re bound to get lots of false positives. (That is to say, the clustering illusion and apophenia may be natural consequences of this strategy. It’s a good sign when your model has the same failure modes as the human brain.) We don’t necessarily have our entire history of observations available, so we will generally need to take fresh data. Anecdotally, if I’m not able to test out a hypothesis quickly, I tend to forget about it. It’s mentally draining to juggle multiple competing hypotheses. We have limited capacity to remain uncertain about concepts, so we need to settle questions promptly.

These don’t necessarily need to be experiments, for the following reason. Even given the same data, different learners may develop different hypothesized correlations. Therefore, curated data that is correlated with the hypothesis that the learner is currently entertaining will be more valuable than random data. Of course, by allowing experiments, we also gain the ability to draw causal conclusions, beyond mere correlation conclusions.

Why is causality important? It comes back to distributional shifts, or non-stationary distributions. Causal models have a structure like P(Z|X,Y)*P(X,Y), where it is asserted that the distribution P(X,Y) may be changed, but the conditional P(Z|X,Y) is invariant. If a phenomenological model learns a representation like P(X,Y|Z)*P(Z), any change to P(X,Y) may necessitate re-learn the entire joint distribution P(X,Y,Z), whereas a casual model would learn the invariant conditional & only need to relearn the smaller distribution P(X,Y). When applied to situations with large causal networks, this results in catastrophic forgetting for phenomenological models when the independent variables have distributions that are non-stationary.

Acknowledgements

In a very intriguing article “Why does deep and cheap learning work so well?”, Lin & Tegmark suggest that neural networks are effective at learning because their structures mirror the structure of physical laws in multiple ways. In particular, they remark that the highest-arity polynomials in the standard model Hamiltonian of particle physics are teriary (N=4). I don’t think this part of their argument holds much water, but undoubtedly the existence of structure to detect in the first place arises from physical laws.

Another inspiration was the integrated-information theory (IIT) of consciousness. While I wholeheartedly agree with Scott Aaronson’s dismantling of the proponent’s claims to have defined consciousness mathematically, I think they are groping towards a way to define non-trivial information processing (computation) as distinct from mere transmission (computation). It comes down to arity: communication is about N=2, but computation is about N>2. Arguably the best paper so far mathematizing this idea is “Quantifying Unique Information” by Bertschinger et al. However, the same idea came up many years ago in the context of probabilistic databases, and also in information geometry.

A third inspiration was Greg Ver Steeg’s discussion of the “grue problem” and articles about explaining total correlations. Ver Steeg’s approach is interesting because it starts with latent variables, rather than directly comparing raw data variables to one another, which is arguably closer to what we know about the brain’s structure than what I’m discussing.

This ties into an interesting remark from Fredrik Linåker’s dissertation: there’s a chicken & egg problem if you think about learning as trying to compress sensory information. The learner has finite memory, so the compression needs to happen first, but without the benefit of hindsight, it’s not possible to know what is relevant for later learning & thus needs to be represented and stored. In particular, Linåker observes that lossy compression amounts to memorizing the common cases and ignoring outliers, while assuming that the data distribution is stationary. However, the objective of lossy compression is to minimize the amount of information it takes to represent the world given a certain degree of acceptable error. This is not the right objective for learning, although reducing the size of the representation is part of both. The difference is that lossy compression is essentially driven to represent the data primarily in terms of the univariate marginal distributions. For learning, this is less important, and the structure (ie, information, correlations) that exists at higher arity comes to the foreground.

I found out after drafting this post that Yoshua Bengio has been thinking along much the same lines. Also, transformer networks seem to embody similar ideas about sparse detection of correlations.

Compacting deep neural networks

TL;DR: it may be important to ‘compact’ deep neural networks by pushing the nontrivial information processing to occur at the lowest possible layers. Perhaps a regularization penalty could implement this during training.

Motivation

An obvious next step for AI development is to combine various narrow AI modules (eg, image classifiers, large language models, …) and other tools (programming language interpreters, databases, etc) into something resembling the “global neuronal workspace.” Allowing division of labor would result in powerful synergies. Each component can contribute its strengths while the others compensate for its weaknesses. The success of transformers is something of a step in this direction. (Cf. Yoshua Bengio’s recent talk.)

One aspect of the brain that won’t be possible to mimic initially: the ability to offload conscious processing t pre-conscious processing. This ability allows our conscious attention to rise to a higher level, by reducing the complexity of the conscious portion by providing more powerful primitives. However, this process (you might call it cognition-supervised learning) requires on-line learning, which is presently an unsolved problem.

Compaction to the rescue?

If an on-line learning network needs to learn new, higher-level functionality, the functionality it is currently implementing may need to be compacted. I can imagine this happening either laterally or vertically. I’m focusing on the idea of vertical compression — perhaps not relevant to the brain, but arguably more relevant to current DNNs that have a large number of layers.

I propose an approach using a “push-down” or compaction effect. Suppose a handful of signals are passing effectively unprocessed from the input layer to some higher layer, where they are then combined. Short-circuiting this processing by forcing it down to a lower layer & simply sending up a direct signal representing the result to the higher layers would free up the higher layers for more abstract processing. Think about it as a way to refactor a DNN: changing the way the computation is achieved without changing the function that is being computed. The presence of the combined result at a lower layer could also enable it to be combined further with other signals at that layer, whereas this would not have been possible (at least in a pure feed-forward network) if the result appeared only at the highest layer.

As for how to encourage compaction: imagine the inbound connections converging on a node in the network as having ‘tension’ proportional to the amount of information flowing through them. Then the ‘energy’ of the node would depend on the number and length of the inbound connections, weighted by importance. A configuration where the same computation was carried out lower in the network would have reduced energy. If the network were allowed to ‘relax’ (minimize energy), the tension would pull the computation down to lower layers. There must be a countervailing invariant that prevents the network from minimizing the energy in a trivial way (dropping all connections, for instance). Presumably this would be performance on the training data. I’m imagining a penalty function based on the energy that is added to the training loss prior to back-propagation. Success might require balancing the strength of the penalty carefully. Successful application would result in the higher layers of a DNN being reducible to identity or linear maps (under the assumption that the task was sufficiently simple and the DNN more than sufficiently deep).

The penalty must distinguish trivial vs nontrivial dependence. This is a hard problem in general, but for a DNN layer it is easier. A stack of multiple layers that each behave as a simple linear transformation can be ‘squashed’ down to a single linear transformation layer. Thus, we want to measure the extent to which the nonlinearity is exploited by a node, in an information-theoretic sense. Consider a ReLU whose input is always above, or always below, zero. These are clearly not making use of the nonlinearity. We would like a function that is zero in those extremes, and maximized somewhere in between. For simplicity we might chose Q = P(X>0)*P(X<0) where X is the input to the nonlinearity. (We don’t want the function to be scale-sensitive, so it should only be a function of the probabilities, not involving any moments of X itself.)

The overall penalty function could look like Sum_N( N^g * Sum_M(Q_M)) where N runs over layers and M runs over nodes. Minimizing this function would result in:

Moving the nonlinearity exploitation (ie, nontrivial computation) downward, due to the weighting by N^g.
Reducing the overall use of the nonlinearity.

Using large values of the exponent g might help maintain good performance, because it would reduce the tendency for the penalty function to eliminate useful computations outright, instead encouraging them to migrate to lower layers.

I’m not sure how well this will work in practice. The penalty will have to use the sample estimate of those probabilities from whatever batches are used for training, which might cause problems when the probabilities are small but nonzero. I’ve been considering individual nodes, but neural networks generally involve distributed representations. The ‘tension’ might disrupt these representations. On the other hand, it may nudge networks toward disentangled representations. Speaking of tangles: there may be topological issues as information streams attempt to cross through each other. On the other hand, maybe 3D this intuition doesn’t apply in the high-dimensional space of neural network configurations.

An interesting place to start would be to look at the distribution of nonlinearity-exploitation in existing trained DNNs, and perhaps the evolution of the penalty function value during training. Retraining an existing model with varying weight applied to the penalty function & looking at the impact on the learning rate & penalty evolution would be the obvious next step. Application of the penalty may increase performance. It may be interesting to gradually reduce the weight of the penalty function, to encourage formation of higher-level representations over training time.

PS: “Explaining and harnessing adversarial examples” by Goodfellow et al argues that the susceptibility of NNs to adversarial examples is caused by their linearity. Perhaps the regularization will make this worse. It might be necessary to reward nonlinearity instead of penalizing it, if robustness is the goal.

Revisiting the theta-pinch fusion reactor concept

When it comes to concepts for nuclear fusion reactors, it seems like all the good ideas were had decades ago. Many are concept dating back to the 1950-1970s era are still being pursued: Z-pinch, mirror, tokamak, spheromak, FRC, stellarator. I’ve been wondering if there are any others that were overlooked but might be worth revisiting in light of modern-day technology, especially REBCO superconductors that allow for astonishingly high magnetic fields.

Here’s one that hasn’t been revived yet: the (linear) theta-pinch. It’s about the simplest magnetic confinement fusion (MCF) concept you can imagine: a solenoid. The reactor consists of a cylindrical magnetic coil, with a uniform magnetic field that runs along the axis. That’s it. The only other MCF concept this simple is the Z-pinch, which is probably why these were among the first concepts investigated. (The ‘pinch’ appellation comes from the idea that these would be pulsed devices, heating the plasma by squeezing it with rapidly-rising magnetic fields.)

The main attraction of the linear theta pinch is the geometric simplicity of the design. There are several reasons that this matters for the reactor economics.

It eases the design of the coils compared to toroidal devices such as the tokamak or stellarator.
The peak field on the coils can be closer to field experienced by the plasma, which enables better utilization of the magnets.
The plasma-facing components are not topologically trapped by the coils: they can be removed axially, without disassembling the magnetic coils. This is critical for enabling rapid replacement of the plasma-facing components.

To amplify the final bullet point: fusion reactors are very hard on the plasma-facing components, due to the combination of large steady-state heat loads, even larger transient heat loads, and large fluxes of high-energy neutrons. It will be necessary to replace the plasma-facing components frequently, perhaps multiple times per year:

Bob Mumgaard, CEO of Commomwealth Fusion Systems, regards neutron flux as part of a fusion power plant’s wear-and-tear—a routine aspect of its servicing cycle, happening perhaps once or twice a year.
https://nautil.us/issue/86/energy/the-road-less-traveled-to-fusion-energy

The trick to making a fusion reactor economical is to minimize the time it takes to replace the plasma-facing components. Like fission plants, the primary driver of the levelized cost of electricity will be financing charges on the capital expense of the plant. This implies that high capacity factor is favored. In addition, due to radioactive decay of tritium, it may be necessary to keep a fusion reactor in operation >90% of the time (because the reactor must create fresh tritium from lithium during operation to balance decay losses). Up-time is therefore an existential problem for fusion reactors. If the plasma-facing components are to be replaced twice per year, each operation must be accomplished in about 18 days.

Toroidal devices (ie, tokamaks and stellarators) are at a disadvantage because their plasma-facing components are topologically trapped inside the rib-cage formed by the toroidal field coils. In contrast, the entire vacuum vessel and first wall of a linear theta pinch could be free-standing inside the solenoid, accessible through the ends of the reactor. There would be two necessary penetrations through the magnet cryostat. First, the blanket coolant circulation would need to enter/exit between magnet coils. Second, any fueling (probably pellet injectors) would also require penetrations. Initial heating to ignition might also be done through radial penetrations, but probably axial injection would be more appropriate.

There are two other nice features of the linear theta-pinch, in addition to engineering simplicity. First, in principle, it should be MHD stable. The magnetic field is either linear (in the central section) or curved outward (at the ends) which means that the system wouldn’t suffer MHD instability, unlike the mirror or Z-pinch concepts. In a sense, it’s a fairly low-risk concept, because it should be easy to simulate accurately, so there shouldn’t be too many surprises on the performance side. Second, there’s no requirement to drive plasma current, so in principle it’s possible to reach true ignition, unlike the tokamak, but similar to the stellarator.

There’s one major problem with the linear theta-pinch: the axial confinement is atrocious. While the magnetic field is effective at preventing the plasma from spreading out radially to the walls, it can’t prevent motion along the length of the cylinder at all. That places the axial confinement squarely in the inertial regime — the only thing stopping the plasma from escaping is the plasma itself taking a finite amount of time to accelerate & travel out the ends. The trick with inertial confinement is to make the plasma density very high, so that the plasma is tripping over itself as it tries to escape. The other option is to make the plasma very long. The relation in this case is that the length of the plasma goes like the temperature squared over the density: L~T²/n.

The plasma needs to reach a temperature of about 5keV at a bare minimum to ignite. The product of the density and the temperature is the pressure, and the pressure cannot be higher than the energy density of the confining magnetic field. So at fixed temperature, the length of the plasma goes like L~1/B² because the pressure P~nT~B², so n~B². What’s the constant of proportionality? Fortunately, this has been calculated for us assuming 10keV temperature and 100% utilization of the magnetic pressure: L=(2×10⁶ [T²*m])/B² (“Scaling laws for the linear theta pinch: a comparison of magnetic and laser heating”, Ellis & Sawyer, LANL Tech. Report, 1973). That is, if the magnetic field is 100 T, then the reactor ‘only’ needs to be 200 meters long, but if you use a more reasonable 20 T field, then the reactor would be 5km. Yikes. According to Huub Weijers, who designed the world-record superconducting magnet at 32T, “I don’t know what that limit is, but it’s beyond 100 teslas. The required materials [REBCO] exist. It’s just technology and dollars that are between us and 100 teslas.” So let’s imagine we can go to 100 T.

How does the cost work out? The basic scaling for the amount of tape needed is C ~ R*L*B (neglecting the reduction in critical current density with field, which is fairly weak for REBCO), whereas the fusion power scales like W~r²*L*B⁴ — but then remember that L~1/B² so this boils down to W~r²B². (Lower-case ‘r’ is the plasma radius, uppercase ‘R’ is the coil radius, which is larger to accommodate the ‘blanket’ that absorbs the neutrons to generate heat and produce fresh tritium fuel from fissioning lithium.) So, the total power dependence on B is quadratic, while the cost of the tape goes like C~1/B, so going to higher field is beneficial for the economics. To put some numbers on it, suppose that Bob Mumgaard’s optimistic forecast works out, and HTS tape comes down to $10/(kA*m). Suppose the coil radius is R=2 m (it can’t be any less than 1 m b/c of the blanket constraints). The REBCO tape costs alone would be $2B for a 100 T reactor. That’s pretty steep, and it’s about the minimum, because the only way to reduce the cost is to go to higher field. Also, a 200 m reactor length is pretty unwieldy.

A test device could be built for much cheaper, by leaving out the blanket to reduce R, and starting with a shorter, lower-temperature experiment. If the minimum R is 0.25m, and L is 20m, the tape cost would be around $25 million. Because of the L~T²/n scaling, and with the same magnetic pressure (hence the same P~n*T product), L~T³ for achieving the same triple product (albeit not at the correct temperature to actually get fusion): the temperature would need be around 4.5keV, scaled down from 10keV. That would be close enough to have a pretty good idea whether the physics would work at the full-scale reactor. The next phase would be to extend the test device incrementally until the equivalent ignition condition was met. Developing the initial 100 T coils would come first — that would involve beating the current world record for superconducting magnets at 32 T by 3x, the record for steady-state magnets at 45 T by more than 2x, and rivaling the highest non-destructive pulsed magnetic field at 100T. (See Maglab’s world records.) Quite an undertaking, given that the magnetic pressure is about 4GPa, which is higher than the ultimate tensile strength of kevlar (3.6GPa) and high-strength steels.

How much fusion power can you buy with $2B in REBCO tape using a linear theta-pinch? The fusion power density is roughly 0.34 MW/m³ per bar², and the magnetic pressure is P~ B² *(4 [bar/T²]), so the fusion power density at 100 T is about 14 GW/m³ at max. Even if you sandbag that to account for running below the optimal temperature, it’s clear that there’s an upper limit on the radius of the plasma column determined by getting too much power per unit length. Let’s suppose that between profile effects, fuel dilution, and running at lower temperature, that the power density averaged over the plasma cross-section is “only” 1.4GW/m³. A 10cm radius plasma column would then crank out 43 MW per meter of length. Let’s say that the vacuum vessel is 50cm in radius, then the total fusion power wall loading is 13 MW/m², which isn’t too bad. The moral of the story is that you can get as much power as you can manage. If the utilization factor in the axial direction is 25%, then the total fusion power would be 2GW. That’s definitely not enough, because I’m trying to get the cost of the reactor down to $1/W. If you raise the plasma radius to 30cm, you’d get 18GW, and assuming 30% net efficiency, that’s 6GW electric power, which is better, because maybe 1/3 of the cost of the reactor would be in the magnets while still reaching $1/W. Unfortunately, the total wall loading is now over 100 MW/m², which is terrifying.

Speaking of heating: the likely methods would be to heat the plasma initially using electron beams (ex: GOL-3 experiment) or laser beams fired in from the ends of the machine. The necessary power/energy requirements would be quite large. The stored energy in the plasma is E=pi*r²*L*P, so for r=0.3m, L=200m, and P=4×10⁴ bar, that works out to over 200 GJ. That much pulsed-power energy storage would doom the economics. Perhaps it would be possible to ignite a smaller-radius core plasma and allow the burning region to expand to fill the larger radius. According to this article (“Particle and Heat Transport in a Dense Wall-Confined MTF Plasma (Theory and Simulations)” by D.D. Ryutov et al, Nuclear Fusion, 2003), the minimum product r*B > 1 T*m is needed for strong self-heating by alpha particles, which would imply 1cm is the lower limit. Starting with a 1cm core would reduce the pulsed-power energy requirements by a factor of 900, to something like 250MJ. This is still excessive, though. This energy needs to be delivered on the scale of the plasma confinement time, and probably the coupling efficiency will be ~30%. So it’s still ~ 1GJ, which might cost something like $2/J (just for the capacitors), so that would be comparable to the cost of the magnets (around $2B).

There are a couple tweaks that you might imagine to improve the confinement & reduce the length of the reactor. Running at reduced temperature helps, up to a point, because the L~T²/n scaling plus the magnetic pressure balance scaling P~nT conspires to reduce the length as T³, at least near 15keV where the Lawson product is approximately constant. At lower temperatures, the required value starts to go up dramatically, so there’s an optimal temperature somewhere between around 5keV (the minimum ignition temperature due to x-ray radiation losses) and 15keV. It’s not clear from my reading if 10keV is the optimum — perhaps a small improvement could be found by going lower to 7keV or so.

Separately, it might be beneficial to introduce cold gas along the length of the device, to reduce the temperature. That is, instead of balancing the plasma pressure with acceleration, it could be balanced by collisions, with the temperature progressing from 10keV to essentially zero by the ends of the machine. In that case, it’s the electron thermal conductivity that sets the energy confinement time. One study found an increase of 3x in the energy confinement by introducing solid end-plugs (“Energy- and Particle-Confinement Properties of an End-Plugged, Linear, Theta Pinch”, R. J. Commisso et al, Physical Review Letters, 1979). Another noted that the electron energy conduction also decreases with increasing charge of the ion species, so injecting xenon for instance might reduce the conduction losses and result in 3-5x shorter reactor (“End Pluggings of a Linear Theta Pinch by Hot High Z Plasmas of High Mass Number”, Hirano & Wakatani, 1976). I have my doubts about whether this would actually help, especially for steady-state, because the xenon will diffuse into the burning plasma region & result in increased x-ray losses.

Hirano & Wakatani project that a breakeven experiment for a pulsed reactor could be done at 30m long with 30 T fields. That would be significantly more attractive. A 32T all-superconducting magnet has already been built. Still, if the magnet cost per unit length for the system only scales down linearly in B, then the required linear power density for good economics also scales the same way. However, because the magnetic field is now 3x lower, the power density is 81x lower, so the plasma radius would need to grow from 30cm to about 1.5m, so we’d also need to increase the 1st wall radius from 50cm to 1.5m or more, driving up the magnet coil radius to at least 2.5m and cutting into the economics slightly. OTOH, the larger first wall radius helps with the wall loading, which would be roughly 10x lower, a reasonable 10 MW/m². Interestingly, an equivalent toroidal device would have R=4.7m, aspect ratio ~ 3. The field would be higher than required by a tokamak for breakeven, and the power density might be comparable. Perhaps my estimation of the power density is too pessimistic; I’d expect the power density to be higher due to the improved plasma pressure (beta) and higher field.

At the end of the day, the linear theta pinch is probably not worth resurrecting. It’s not likely to be any less expensive than a tokamak, and the physics is less established. In particular, the theoretical minimum pulsed-power requirements are excessive, and realistic requirements might be completely prohibitive. Finally, the geometric simplicity advantages for replacement of first-wall components may not pan out, due to the necessity of coolant plumbing. It’s been fun to think about, but not so practical.

Software scalings

Software complexity

How does the size of software scale with the number of features? Some folks assume that the scaling should be linear:

With his newfound power, he built his operating system on top of his own hardware from scratch in 12K SLOC, with a footprint of 200 kilobytes. For comparison, OSX runs in on ~86M SLOC with a footprint of 3 gigabytes, built by one of the wealthiest companies in the world. Now, perhaps OSX is more feature complete than Oberon, but certainly not by a factor of ~40 000X. Something was lost along the way.
Fredrik Holmqvist, Brooks, Wirth and Go.

This might be true if each feature were modular or independent, like adding a new separate application. In that case, adding a feature only requires adding new code to support that particular feature, without changing other code for preexisting features.

However, there are other ‘features’ which aren’t modular. For instance, consider adding the ‘security’ feature. Security is one of those attributes that can’t realistically be slapped onto an existing system — it may require a total rethinking of many core design choices. Now consider a feature list such as “multi-user, secure, multi-language, multi-processor, performant, …”. These are all attributes that a homebrew OS probably doesn’t have, but a mainstream OS needs.

The majority of your complexity budget is burned on these things interacting.
Hillel Wayne Reject Simplicity, Embrace Complexity

Adding each of these ‘features’ might multiply the difficulty of the problem and the complexity of the solution, because each feature interacts with others. That would be an exponential scaling. Taking the ratio of the logs of the size of the two systems yields 1.78, which would imply that OSX should have 78% more ‘features’ than the homebrew OS. This is super hand-wavy and likely pessimistic. Maybe the scaling should be sub-exponential, because presumably not every feature interacts with every other one. The point is that OSX only looks bloated if you assume linear scaling, which is almost certainly too optimistic. It’s possible that most of the complexity is inherent complexity, not accidental complexity.

Probability of a software project running late

In an interesting blog post (Why software projects take longer than you think: a statistical model), Erik Bernhardsson shows that to a decent approximation, the time to complete a software project formed a log-normal distribution around the estimated completion time. That is, the completion times behave as if you sampled from a normal distribution and then took the exponential of that number, and scaled the actual completion time by that random factor. This is somewhat surprising, for the following reason.

One way to estimate the time it takes for a project: breaking down work into chunks, estimate the time it takes to complete, and then sum up the times for the chunks. If you break the chunks small enough, the central limit theorem says that the error in estimating the overall completion time should be a normal distribution, regardless of the shape of the distribution of the errors for the individual chunks. So, it might come as a surprise that the error distribution is log-normal instead.

Bernhardsson notes the issue I raised in the previous paragraph in a footnote, but doesn’t speculate about the implications. It might be related to the exponential nature of the complexity of software as a function of the number of features. If that is correct, then the normal distribution inside the exponential would represent the number of features that the project needs to deliver.

At a lower level, software projects are more aptly thought of as tree structures of tasks and subtasks than lists of steps. Recall the parable of yak-shaving: there are many seemingly-irrelevant details that become critical as one zooms in on the reality of accomplishing even a mundane task. See also Reality has a surprising amount of detail. My thought is that the estimation error isn’t due to estimating the duration of the steps. Rather, it’s caused by adding or neglecting certain tasks (along with all their subtasks, if any). Examples of neglecting a major task could include an “unknown-unknown” technical problem that crops up during development, or failure to identify all the stakeholders and their requirements.

A project can be modeled as a random recursive tree, where each note is a task or subtask. The estimation task would be modeled as adding or dropping branches at random, to produce the estimated project structure from the actual project. We would like to know the probability distribution for the size of the actual project given the size of the estimated project, which also requires us to know the true distribution of project sizes. It might be easier to work the other way, by postulating a model for the process of going from the estimated project to the real project by adding branches with some probability.

It would be interesting to run some Monte Carlo simulations (for instance, using RandomTree.jl) to see what assumptions about the trees and the pruning/growing probabilities it would take to reproduce a lognormal-like distribution. It would also be interesting to get real-world data on the estimated and actual work breakdowns for comparison. Fields with less uncertainty in the project structure (residential construction?) might have something closer to a normal distribution for completion times.

Is public or private funding better for research and innovation?

I’ve collected a pile of quotes on risk-tolerance and how best to fund breakthroughs. There’s a dominant theme: big-government sponsorship is inimical to the truly-novel research that is necessary to generate breakthroughs. The argument goes like this:

Progress in research entails doing truly novel things, which means there are risks. Conversely, this implies that low-risk activities won’t yield major breakthroughs. Structural incentives for federal bureaucrats, for grant-approving committees, and hence for individual scientists ensure that low-risk, short-sighted activities are favored. Therefore, the progress of publicly-funded research is hobbled. In the private sector, the inefficient are winnowed out by natural selection. In contrast, the government is an assured stream of funding, so ineffective processes are not culled.

The counterpoints I’ve found are mostly to the effect that private funding isn’t necessarily immune to perverse incentives, nor does it necessarily have better ways of selecting winners (in fact, private-sector due-diligence is probably scientifically less rigorous). The upshot seems to be that private funding is the worst kind — except for all the others.

The following two sections are my thoughts; skip them if you want to go straight to the quotes.

Should we give up on public funding?

Writing off public-sector funding of science & research is a poor strategy for the long term. The government is the natural vehicle for socializing the inherent riskiness of scientific research, because:

It can most effectively “average out” the fluctuations in success (ie, it can afford a larger portfolio & longer time-scale than private firms)
The benefits of scientific research accrue to the whole of society, so it makes sense for society as a whole to sponsor it, rather than relying on the enlightened self-interest of corporations, which may result in a suboptimal expenditure (although perhaps over-eager investors are subsidizing the common good)

I question whether risk-aversion is inherent within the government. There are counterexamples (the Manhattan & Apollo projects, DARPA,…), but interestingly they primarily are within the context of wartime (WW2 and the Cold War). My hypothesis is that a lack of existential competition on the international scale is a precondition for the stagnation of federally-sponsored research. Not that natural selection happened at the national scale, but because the possibility of it was internalized as a willingness to take big risks and swing for the fences. However, there are interesting alternate hypotheses:

The financial sector infected the government with a risk-management philosophy starting around the 1980’s. This is particularly interesting because it ties into society-wide/global trends. This would implicate the private sector as well.
Every organization ossifies with old age. State-sponsored science was young and effective once upon a time, but it has inevitably aged.

What to do?

Aside from a war or power-cycling the entire enterprise, is there any way to improve the system? Perhaps, although I don’t claim to have a solution, I’m just trying to stimulate discussion. It’s a hard problem, no doubt. Without the feedback mechanism of the market, selection pressure for grantmakers is not natural selection. All that remains are subjective evaluations, or artificial quantitative measures (like publication count) that are subject to being gamed. Any feedback loop is also limited in effectiveness by the timescale it operates on. As the timescales approach the duration of a career, the chance for feedback goes to zero. Unfortunately, the issues are too complex to rely on anything less than human intelligence (ie, algorithms that could be back-tested).

The DARPA model and the COTS program are two interesting ideas for how to change the incentive structures of government-sponsored R&D. In both, a key insight is to reduce the amount of control exercised by the funding agency over how the funding recipient spends the money. However, there’s a tendency for reversion to the mean: unless the upstream incentives or selection pressures for the bureaucrats are modified, controls will creep back in after the inevitable screw-ups (ex: Solyndra). DARPA is supposed to have addressed this with term limits, such that there is no pressure to justify the decisions down the road b/c there’s no career path for the project managers to maintain. It’s not clear that this has been successful, long-term: there’s no pressure to avoid failures, but corresponding little reward for success. With COTS, the fixed-price contract absolutely changes the incentives compared to the price-plus contracts typical of defense. While the program has been successful so far (SpaceX delivering cargo & astronauts to the ISS), the real question is whether it will survive its first major failure.

On to the quotes:

Government-funded research is doomed to stagnation

It is time for corporations and entrepreneurs to recognize that they can no longer rely on governments to fund and on academia to conduct fundamental research. Instead of doing translational research and simply bringing academia’s fruits to market, they have to become bolder and take responsibility for what the future will look like, fund fundamental research, and bring to life a new vision for philanthropy.

Why?
Firstly, since academia lacks the mechanism for competitive displacement, bloat accumulating over time and inevitably rising risk-aversion can grow without bound [11]. If a firm becomes inefficient, it collapses and is replaced by another one. If science becomes inefficient… it continues to take in money and people and, well, scientific progress slows down. Secondly, as I pointed out above, science is severely afflicted by the problem of misaligned timescales. Grantmakers’ planning horizons (note that I’m talking about specific individuals who make specific decisions, not abstract institutions that theoretically care about the long-term) are severely limited by their own career planning horizons and by their understanding of what it takes to work on fundamental problems with little short-term payoff.
Alexey Guzey, Reviving Patronage and Revolutionary Industrial Research

[S]oftware […] has a convex performance-to-value curve, as with creative fields. Most fields have concave curves, in that not screwing up is more important than hitting the top bars. This convexity means two things. First, it means that managing to the middle decreases total performance. Second, it means that risk (measured by the increasing first derivative of a convex function) is positively rather than negatively correlated with expected return.
michaelochurch on reddit

Now, the Valley tech scene does just fine with a 90% failure rate, because the win associated with backing successful execution of a really disruptive idea is much more than a 10x return on investment. The good ones return 50x, and the home runs deliver 100x or more.
The Innovation Dead End

“So why am I not an academic? There are many factors, […] but most of them can be summarized as “academia is a lousy place to do novel research”. [….] My supervisor cautioned me of the risks of doing work which was overly novel as a young academic: Committees don’t know what to make of you, and they don’t have any reputational prior to fall back upon. […] In many ways, starting my own company has given me the sort of freedom which academics aspire to. […A]cademic institutions systemically promote exactly the sort of short-term optimization of which, ironically, the private sector is often accused. Is entrepreneurship a trap? No; right now, it’s one of the only ways to avoid being trapped.
Colin Percival, On the Use of a Life

Why not just have the government, or some large almost-government organization like Fannie Mae, do the venture investing instead of private funds?
I’ll tell you why that wouldn’t work. Because then you’re asking government or almost-government employees to do the one thing they are least able to do: take risks.
As anyone who has worked for the government knows, the important thing is not to make the right choices, but to make choices that can be justified later if they fail.
Paul Graham, Inequality and Risk

In today’s university funding system, you need grants (well, maybe you don’t truly need them once you have tenure, but they’re very nice to have). So who decides which people get the grants? It’s their peers, who are all working on exactly the same things that everybody is working on. And if you submit a proposal that says “I’m going to go off and work on this crazy idea, and maybe there’s a one in a thousand chance that I’ll discover some of the secrets of the universe, and a 99.9% chance that I’ll come up with bubkes,” you get turned down. But if a thousand really smart people did this, maybe we’d actually have a chance of making some progress. […] This is roughly how I discovered the quantum factoring algorithm.
Peter Shor comment on Peter Woit’s blog

Shor concludes with an appeal for individual scientists to sneak off and do ‘bootleg research’ or use their spare time. I think this is misguided — we should fix the systematics, not rely on individuals to buck the incentive structure. Individual researchers working in their spare time are limited in the scope of what they can address. While quantum information theory requires little more than pencil and paper, fusion energy research, for example, is not amenable to this approach.

Well-meaning but disastrous government initiatives to support the startup ecosystem relentlessly pump money into the startup scene. This money is advertised as “free” or “non-dilutive”, but in reality it’s the most expensive kind of money you can imagine: it’s distracting, it begs justification, it kills creativity, and it turns your startup into a government work program.
Alex Danco, Why the Canadian Tech Scene Doesn’t Work

You cannot simply add money and create a tech scene. If you do, then either that money will be too freely available and attract the wrong kind of opportunists, or it’ll be like grant money that takes up so much of the founder’s time and energy that it distracts them from actually starting and running the business in the first place.
Alex Danco, The social subsidy of angel investing

There’s a tremendous bias against taking risks. Everyone is trying to optimize their ass-covering.
Elon Musk interview, in the context of major aerospace government contractors

There are some very interesting tangents here as well: Alex Danco’s remark about “money will be too freely available and attract the wrong kind of opportunists” raises an interesting question about the moral hazard of easy VC money and the Pareto front for funding effectiveness vs selectiveness.

Private-sector funding isn’t perfect

The private sector can also suffer from the same perverse incentives. Style is just as important as substance in raising capital. Investors don’t have the technical expertise for due diligence on highly technical topics. Individual researchers have inherent incentives to be risk-averse, regardless of the funding source.

Because, you know, when it comes down to it, the pointy-haired boss doesn’t mind if his company gets their ass kicked, so long as no one can prove it’s his fault.
Paul Graham, Revenge of the Nerds

As a friend of mine said, “Most VCs can’t do anything that would sound bad to the kind of doofuses who run pension funds.” Angels can take greater risks because they don’t have to answer to anyone.
Paul Graham, The Hacker’s Guide to Investors

Private money is in some sense philanthropic, in that most VCs are not able to effectively pick winners, and angel investors are motivated less by expected financial reward than by social credit. This harkens back to Renaissance patronage of the sciences (see also Alexey Guzey’s article).

[T]he median VC loses money. That’s one of the most surprising things I’ve learned about VC while working on Y Combinator. Only a fraction of VCs even have positive returns. The rest exist to satisfy demand among fund managers for venture capital as an asset class.
Paul Graham, Angel Investing

… angel investing fulfils [sic] a completely different purpose in Silicon Valley than it does elsewhere. It’s not just a financial activity; it’s a social status exercise. [….] From the outside, angel investing may look like it’s motivated simply by money. But there’s more to it than that. To insiders, it’s more about your role and reputation within the community than it is about the money. The real motivator isn’t greed, it’s social standing.
Alex Danco, The Social Subsidy of Angel Investing

Much of what counts for successful funding in the private sector is personality, not substance:

A lot of what startup founders do is just posturing. It works. VCs themselves have no idea of the extent to which the startups they like are the ones that are best at selling themselves to VCs. [6] It’s exactly the same phenomenon we saw a step earlier. VCs get money by seeming confident to LPs, and founders get money by seeming confident to VCs.
Paul Graham, What Startups Are Really Like

The Silicon Valley tech ecosystem is a world of pattern matching amidst uncertainty.We pattern match ideas, we pattern match companies, but most of all we pattern match people. […] If you’re too different, you won’t fit the pattern at all, so people will ignore you.
Alex Danco, Social Capital in Silicon Valley

This could be related to the fact that venture capital doesn’t have a great substitute for peer review:

First-rate technical people do not generally hire themselves out to do due diligence for VCs. So the most difficult part for startup founders is often responding politely to the inane questions of the “expert” they send to look you over.
Paul Graham, How to Fund a Start-up

As an aside, I suspect that the issue of hiring experts for due diligence is probably more on the demand-side. First, part of the prestige of being an investor lies in the appearance of discernment. Outsourcing that would reduce the social/emotional returns. Second, it creates a principle-agent problem. Third, there’s probably some value to NOT having technical experts involved, because they would decrease the diversity of what gets funded.

There’s no silver bullet

Innovative people gravitate to startups, so it’s hard to tell cause from effect:

There’s no magic incentive structure that fixes this problem. If you think startups encourage innovation because early employees are rewarded disproportionately for success, think again. Startups benefit from selection bias — people looking to play it safe don’t take a job with a startup in the first place.
Jocelyn Goldfein, The Innovation Dead End

To the extent that you only live once, you can’t socialize the risk of taking a big gamble on your project:

As individuals, we have no portfolio strategy […] When we fail, most rational people respond by trying to avoid dumb ideas and pick smart bets with clear impact the next time. […T]he self-same employees who are innovating and taking risks with you today are going to become risk averse and careful once they start failing — and sooner or later, they (and you) will fail.
Jocelyn Goldfein, The Innovation Dead End

Stagnation is a function of the age of an organization. (Cf. Collapse: How Societies Chose to Fail or Succeed – complex societies respond to challenges by increasing in complexity, until diminishing returns set in.)

I have a hypothesis that a lot of the useless bureaucracy isn’t a function of government/private, or size of organization, it’s time. As an organization gets older, it’s gotten burned more times, and created more rules to deal with it. For example, all the “normal” space launch companies and NASA […] killed astronauts, and had a huge number of rockets explode […] This resulted in the creation of rules to attempt to reduce risk. […] SpaceX last year had a failure where they destroyed a customer’s satellite during an on-pad test where the satellite didn’t really need to be mounted. […] I’ll bet SpaceX is going to be a lot more cagey in the future about test firing rockets with payload aboard.
Eventually, a private company will go out of business, but the Government won’t declare bankruptcy. So there’s less limit on governments getting bloated with stuff like this.
CatCube on SlateStarCodex

In practice, there’s no surefire algorithm to tell the difference between truly novel good ideas and truly novel bad ideas.

[I]nnovative ideas are roughly indistinguishable from dumb ideas. If something seems like a clearly good idea, it is also an obvious idea. So everyone is doing it! The big ideas that fundamentally change the status quo usually have fatal-seeming flaws that keep (almost) everyone from pursuing them….
OK, OK, so picking good ideas is hard, but that’s what defines great innovators, right? The ability to tell the “crazy” idea from the “crazy awesome” idea? It would be nice to believe that — but empirically speaking, there’s no evidence to support it. The best track record of repeated technical innovation in high tech is the collective Silicon Valley startup scene. And guess what — 9 out of 10 VC-backed startups fail. The absolute best-of-the-best, most profitable VCs? Maybe a 25% success rate.
Jocelyn Goldfein, The Innovation Dead End

Maybe this mistaken attitude toward risk actually came from the business world and infected the government (which would explain why the government used to be able to do innovative things).

[T]he mantra of this technocratic system of management is the word “risk”, which if you do a word analysis, didn’t really exist in political coverage until the mid 80s. It comes from finance, but as economics colonised the whole of politics, that word spread everywhere, and everything becomes about risk-analysis and how to stop bad things happening in the future.
Politics gave up saying that it could change the world for the better and became a wing of management, saying instead that it could stop bad things from happening.
Adam Curtis, quoted in an interview The antidote to civilisational collapse

Intuitionism, probability, and quantum

This is a bit speculative. If you have expertise in this area, I’d love to hear your thoughts. The basic idea is that one can get something like intuitionist logic as a meta-logic derived by relaxing from a functional truth-valuation to a relational one. This logic, when extended to probability, might form a useful basis for modeling common-sense reasoning. A slight reinterpretation might form the basis for a novel perspective on quantum probabilities.

Intuitionism by ‘lifting’ classical logic

Classical logic considers truth-valuation to be a function from statements to a Boolean (denoted as 0,1 here). Relaxing the valuation from a function to a relation leads to a new valuation that is a function on the power-set of the Booleans: { {}, {1}, {0}, {0,1} }. Let’s say that the relation is “there exists a proof that.” We can have a proof-valuation that says “statement A is related to 0 and 1” which is read as saying that there exists a proof that A is true and also a proof that A is false — which means we have derived a contradiction. If the proof valuation is {}, then we say that there does not exist a proof about the truth value of A — which means that A is undecidable. Relating a statement to either {0} or {1} means that we can prove that A is true (and not false) or vice versa, which is what is meant by constructivist T/F values. Denote the proof-valuation {0,1} as C for ‘contradiction’, and {} as N for ‘undecidable’. Let’s denote the proof-valuation function as V.

We can also talk about membership of 0,1 in the return value of V: “0 in V(A) and not 1 in V(A)” is equivalent to V(A)={1}=T which means that S is proven to be true and not false. This is stronger than in classical logic where we might derive a contradiction without knowing it. That is, proving S “is” true is not incompatible with proving that it is also false in classical logic — the axioms are inconsistent in that case. The proof-valuation thus seems a bit too strong: V(A)=T is equivalent to “A is proved true and this system is consistent.”

We might wish to make weaker statements that do not assert consistency of the system. In that case, we could state that “1 is in V(A)”, which means that it is possible to prove that A is true, but without making any claim about whether A can also be proved to be false. This corresponds to the classical-logic proof that a statement is true, at least in a system subject to the incompleteness theorem. Another way to state this: V(A) is in {{0,1},{1}}.

We can make other ‘vague’ statements like this as well. Let’s also let denote the set { {1}, {0} } as R for a ‘real’ statement. All statements in classical logic fall into this category, so that the law of the excluded middle holds. Denote { {}, {0,1} } as U for ‘unreal.’ Denote { {} ,{0} ,{0,1} } as D for ‘not true’ and { {}, {1}, {0,1} } as G for ‘not false’, and M and L for { {1}, {0,1} } and { {0}, {0,1} } for ‘possibly true/false (but also possibly inconsistent)’. It’s important to note that saying V is in D (‘not true’) is not inconsistent with saying that V is in G (‘not false’): if V is in their intersection, then V in { {}, {0,1} }, which means that either V is undecidable or the system is inconsistent.

These values live in the powerset of the powerset of the Booleans: we’ve now climbed up two meta-levels above classical logic. We might want a third valuation, distinct from the truth- and proof- valuations, to discuss this level. Let’s call this the justification-valuation, J. Thus, J(A) = R is read as “there is justification that the proof-value of A is R.” One can continue up this meta-ladder indefinitely, but for now this is as high as I want to go. At this level, we can already discuss many interesting topics.

For instance, we can see that classical logic, as noted, is tantamount to the assumption that all statements have J(A) =R: all are well-formed AND the system of axioms is consistent. Proof-by-contradiction is reified: the proof-value of A is exactly C={0,1}. Arriving at a contradiction means that the system is not closed in R. We also have a more natural way to express statements that are undecidable, or might be undecidable. Providing a constructive proof takes the justification-value of a statement from a more-vague to a less-vague state. Framing intuitionist logic as simply dropping the law of the excluded middle prevents discussing the reasons why a proved truth-value may not be forthcoming (ie, distinguishing undecidability from contraditions).

Intuitionism and probability

In this section, I’m going to slightly reinterpret the meaning of the valuation V: rather than proof, I’d like to talk about V as “having empirical evidence (+ reasoning) that demonstrates.”

Suppose Q(A) is a ‘metaprobability’ of A: the probability distribution over the possible values of V(A). The ‘regular’ probability P(A) for a statement to be either true or false because the conditional probability for P(A) given that V(A) is in R: that is, assume the question is ‘real,’ what are the relative probabilities of the truth values? This corresponds to the common notion of probabilities, but it’s not the whole story because intuitionist logic seems to be somewhat more natural for folks. Constructing probability theory for intuitionist logic might resolve some unintuitive effects of using standard probability theory for everyday reasoning.

Let L be the indicator function that denotes whether the question implied by A is a real question or not, ie L(A)={1 if V(A) in R, 0 if Q(A) in U}. Q(A | L(A)=1) denotes the probability of the statement being proved at all, and the total probability is Q(A)=Q(A|L(A)) * Q(L(A)). The conditional metaprobability Q(A|L(A)) is all that exists in classical probability theory. This new theory relates to Dempster-Shafer theory, in which probabilities need not sum to 1: in this case, it is because the metaprobability does sum to 1, but there are additional states other than T/F that the probability can ‘leak’ into.

If Q is high for V(A)=C, this indicates we have strong belief that a contradiction exists — we’ve assumed contradictory information. This results in revising our model, questioning assumptions. By operating at this meta-level, our reasoning system can cope with contradictions without breaking down.

If Q is high for V(A) = N, this means we don’t have enough information to trust our answer. This kind of situation drives us to gather more data (observation or experiments).

If Q=0.5 for V(A)=T and V(A) =F, then we are very confident that our reasoning+evidence is correct, but that we cannot answer the question better than chance. This allows us to distinguish from the situation where we have no evidence — where Q is large for V(A) = N. In this situation, we might still assign a small but finite metaprobability to each of V(A)=T and V(A) = F, according to Laplace’s principle of insufficient reason. But more importantly, now we can also encode that we have very little confidence in our determination that these two probabilities are equal! This is the key issue motivation for Dempster-Shafer theory. Their theory gets hung up on how to interpret the missing probability; much the same as the difficulties with the use of ‘Null’ in programming & databases. Here we have two distinct possibilities (for under/over-constrained truth values, respectively N/C) with very clear meanings.

As far as betting goes, if an agent has that meta-probabilities for T,F are both small, the agent might prefer not to bet, even if the odds ratio between T & F is large. Alternatively, perhaps one could assign the remaining meta-probability equally between T & F, for the purposes of taking the expectation. In that case, the effective log-odds would be roughly zero, even though the log-odds of the T/F split is quite large. Then the risk/reward tradeoff looks very different. This needs to be fleshed out.

Intuitionistic extensions to Bayesian reasoning

Bayesian probability theory is an extremely powerful tool, but it has limitations. The following quote speaks to a key limitation, one I’ve experienced first-hand:

It is very important to understand the following point. Probability theory always gives us the estimates that are justified by the information that was actually used in the calculation. Generally, a person who has more relevant information will be able to do a different (more complicated) calculation, leading to better estimates. But of course, this presupposes that the extra information is actually true. If one puts false information into a probability calculation, then the probability theory will give optimal estimates based on false information: these could be very misleading. The onus is always on the user to tell the truth and nothing but the truth; probability theory has no safety device to detect falsehoods.
G. L Bretthorst, “Baysian Spectrum Analysis and Parameter Estimation,” Springer-Verlag series ‘Lecture Notes in Statistics’ # 48, 1988, p30-31. [Emphasis mine]

I should disclaim this by noting that the log of the evidence (that is, the Bayesian generalization of the ‘chi-squared’ value) does indicate when an event is extremely improbable according to a model. Nevertheless, there are pitfalls here, which intuitionist probability theory could avoid, by providing a way to explicitly represent contradictory and insufficient information in a direction orthogonal to the odds ratio between the probabilities for truth and falsity. It seems to me that probability is overloaded, but not because probability theory needs to be modified. Rather, common-sense reasoning is more akin to intuitionist logic, while probability theory is the natural extension of classical logic. Thus, creating a probability theory appropriate for intuitionism might prove valuable for modeling common-sense reasoning.

As a concrete example, consider the following scenario posed by N. Taleb: you are told that a coin is fair, but then you watch as the coin is flipped and results in 100 ‘heads’ in a row. You are offered a bet on the next toss. How would you bet? A naive application of probability theory would result in still believing the initial assertion that the coin was fair, while more sophisticated application would conclude that the coin toss had been rigged. We would be less surprised to have been lied to than to see an event with 30 orders of magnitude of improbability. However, if we had rounded our initial probability of being lied to down to zero (suppose we had reason to trust the person tossing the coin), then we would never be able to update our probability to a finite value – Bayesian updating is always multiplicative.

We have some freedom to determine the updating rules for intuitionist probabilities. In particular, one might desire that probability is transferred away from ‘classical’ T/F values and toward C when we have contradictory evidence/assumptions. Also, by reserving probability for N when we have little/no information, we could set T and/or F to zero and still allow them to become finite later. I haven’t worked this all out yet.

Intuitionism and quantum probability?

I have a hunch that one could interpret the complex probability amplitudes in quantum mechanics by a similar process of ‘lifting’ the state space from classical truth values to some other meta-valuation, possibly the ‘observation’ valuation. That is, one would say “the probability that statement A is observed to be true is 50%”, rather than saying “the probability that the statement is true is 50%.” This is in the spirit of the Copenhagen interpretation of QM. Note that, before an event has occurred, the probability of its outcome having been observed to be anything is zero! That is, the N-value (not enough information) comes into play naturally. This also takes care of the counterfactual difficulties: since the measurement was not conducted, the probability of the outcome will always be 100% on the N-value.

On the other hand, we can rule out the possibility of contradictory observations: the C-value always has zero probability, because objective reality is single-valued. I suspect that this, together with the usual normalization condition, suffices to determine the Born rule. Together they reduce the degrees of freedom of the meta-probability from 4 to 2. This allows mapping the meta-probability to a complex number, which also has two degrees of freedom. I haven’t figured out how to do this mapping yet.

Possible interesting consequences: wavefunction collapse might viewed as the transition from indeterminate state (N) to definite state (T/F). There could be a change in the reason that the marginal probabilities are 50/50 for the outcome of one of the spin measurements in a Bell entanglement experiment when the polarizers are perpendicular: initially, it would be due to lack of evidence (ie, all metaprobability is concentrated on the N-value), but once the measurement on the other side has been made and the result transmitted, the probability switches to being 50% true and 50% false, because now the outcome is a matter of classical probability. Strange quantum-information effects such as negative conditional entropy and superdense coding might be easier to comprehend in this interpretation.

This idea was partly inspired by the two-state-vector formalism, which holds that the wavefunction is not a complete determination of the quantum state. Rather, two wavefunctions are necessary, one from the past and one from the future. Representing the incompleteness of the state via constructive/intuitionistic logic seemed natural: the ‘truth’ of theorems are contingent upon the actual historical progress of mathematics. Rather than being a Platonic ideal that is true or false for all time, theorems only assume a ‘truth’ value once a proof (or disproof) exists. This same contingency (or contextuality, if you will) naturally applies to statements deriving their truth from the observed outcomes of experiments.

[Edit: This was inspired in part by Ronnie Hermens’ thesis: “Quantum Mechanics: From Realism to Intuitionism –A mathematical and philosophical investigation“]

IDL gotchas and misfeatures

IDL is a popular language in the plasma physics & astrophysics world. It has a number of pitfalls and poor design choices that lead to nasty pitfalls and gotchas. Here’s a list of the ones I’ve encountered. Maybe this will save some poor grad student a headache.

https://www.harrisgeospatial.com/docs/automatic_compilation_an.html
“Versions of IDL prior to version 5.0 used parentheses to indicate array subscripts. Because function calls use parentheses as well, the IDL compiler is not able to distinguish between arrays and functions by examining the statement syntax.”
This causes all kind of problems. One nasty one is that a function that IDL can’t find, which has keyword arguments when called, will give a non-descriptive ‘syntax error,’ because it is interpreted to be an attempt to index an array using a keyword argument! This is a very common problem when dealing with a new codebase, because if you don’t set your bash config file up to export the correct directories in the IDL_PATH environment variable so that IDL can find the libraries where utility functions are defined, then you’ll get dozens of errors like this. It took me forever to realize what was happening! 😡

https://www.harrisgeospatial.com/docs/catch.html
There is no ‘try.’ Error catching works by jumping backward in the file to the place where the error-handler is located, executing whatever is in that block, and then trying the failed statement again. There is no way to go around a statement if it doesn’t work.

You can ‘declare’ variables by passing them to a function! They are essentially null variables, and then, in lieu of the procedure being turned into a function (with a return value), the output of the function is placed into the null variable as a ‘side effect.’ This is extremely confusing when reading someone else’s code, if you are coming from another language that makes any kind of sense. Because a function can only produce a single output, some people prefer this backward method of returning data from functions.

The ‘help’ command only helps you if you already know whether an argument is a function, procedure, or data structure. Because you can index an array using parentheses rather than brackets, sometimes you can’t tell if an object is a function or an array by looking at the context!

Common blocks: if you ‘reference’ a common block (or load a save file for that matter) it pulls in the variables as named in the block/file and dumps them straight into the local namespace! This can overwrite other variables of the same name. It’s like python’s “from numpy import *” but for data structures. You can optionally supply a list of aliases for each of the variables you want to ‘import’ from the common block, but if you want the n-th one only, you have to assign junk names to the other n-1 variables that come before it! You can’t just index in, or even address it by name!

Automatically leaves you at whatever place the error occurred in your code. Good for debugging, terribly confusing though when you aren’t used to it, and expect to be back at global scope.

Imports modules on the fly when you call a function. Unlike python where you must name the module first (although you can be sloppy in python and import * from a package, which has a similar effect).

In conjunction with the problem of being able to declare a variable by passing it, you are able to pass an undefined variable, and the traceback indicates that the problem is occurring inside the called function, not the calling one! This one will gaslight you hardcore, because you assume that the incoming variable exists or otherwise your call would have failed, not the function evaluation!

The .compile command doesn’t work inside a .pro file, unlike an import statement in python. You have to use .compile at the interpreter or something @’ed into the interpreter. It’s not clear if there is a way to get around this restriction and specify which file you want to compile a procedure from, other than messing with the IDL path. So the scripting environment and the programming environment don’t behave the same.

If you slice an array down to length = one in some dimension, that dimension only goes away if it is on the end (ie, if you do x[*,i,*] you get back x[6,1,7], but x[*,*,i] =x[6,9] and x[i,*,*] =x[1,9,7].

If you make a mistake when broadcasting arrays, you won’t know it, because array combination just truncates to the smallest length along each dimension.

If your plots all show up red instead of whatever color they were supposed to be, make sure to state “device, decompose=0.” This allows color numbers to be interpreted in terms of a color table, rather than as hex representations of RGB colors. You’re probably using “loadct,39” to load the jet-like color table. Set “!p.color=0″ to set the text/axis color to black, then set ” !p.background=255″ to set the field of the plot to white, and finally, “!p.multi=0” to get rid of subplots.

Suppose you make an anonymous structure s that contains an array x[5,10], among others, and then you make an array of those structures: ss[i]=s for i=0 to 20. Then you look at ss.x and see a [5,10,20] array! However, you can’t index this array properly: the last dimension is an illusion. If you select any one of the elements from the first two dimensions (ie, ss.x[4,5]) you get back an array that is 20 long. But, you can’t do this: ss.x[*,*,0], because that last dimension is a phantom. You have to do this: ss[0].x[*,*] to access the array that way. So confusing!

Also, I just got an overflow from using 201*201 — the default for a string literal is too short!

If you do a contour overplot, you have to specify the levels for the second contour plot explicitly (can’t use nlevels!) or else it will default to the set of level values from the previous contour, which means you generally don’t see the second set of contours and have no idea why not (could have been the x/y range either!, or a color issue, or…)

Summary of fusion critiques

This is the third post in a series on nuclear fusion. The goal of the series is to address assumptions made in critiques of fusion energy, with an eye toward solutions. Lidsky’s and Jassby’s critiques overlap a bit. Here’s a combined list of the main arguments, broken into 3 categories. In this series, I’ll focus on the issues that impact the economics. The technical & societal/safety issues do feed into the economics, so I’ll comment on those places as they come up.

Summary of critiques

Economic issues:
1. Fusion reactors will have lower power density while being more complex, compared to fission reactors. Therefore, they will be larger, more expensive, and slower to construct, hence uneconomical.
2. Radiation damage will require frequent replacement of core parts of the reactor. This replacement, being complicated & robotic/remote, will be slow. Therefore, the down-time fraction of the plant will be large, and the operations & maintenance cost will be high, making it uneconomical.
3. The complexity of a fusion reactor makes accidents more likely. An accident that releases radioactivity would lead to decommissioning of the plant because repair would be impossible. Therefore fusion plants are economically risky.
Safety/societal issues:
1. Although fusion reactors will (with appropriate material choices) have less high-level radioactive waste, they will produce much more lower-level waste (as compared to fission reactors).
2. Fusion reactors can be used to breed plutonium (weapons material) from natural uranium. Tritium is also a weapons material.
3. Accidents & routine leakage will release radioactive tritium into the environment.
Primarily technical issues:
1. Tritium breeding has narrow margins & probably won’t be successful.
2. Fusion reactors require siphoning off some of the electricity output to sustain/control the reaction. This ‘drag’ will consume most/all of the output electricity (unless the plant is very large).
3. Materials that can withstand large heat fluxes and large neutron fluxes and don’t become highly radioactive are hard to find.

Assumptions used in critiques

There are a number of assumptions made in the critiques, some of which are unstated. The most consequential ones are:

The deuterium-tritium reaction will be used.
The reaction will be thermonuclear (ie, no cold fusion, muon catalysis, highly non-Maxwellian distributions, etc)
Reactors will be steady-state magnetic confinement devices.
Specifically, the magnetic confinement device will be a tokamak.
Magnetic field coils will be limited to about 5 Tesla field strength.
The reactor first wall will be solid material.

Not all the critiques depend on all the assumptions. I’ll indicate which assumptions are involved in each critique item. Notably, 7 of the 9 critiques involve the radioactivity & tritium usage of fusion. This motivates considering ‘aneutronic’ fusion reactions. However, aneutronic reactions produce orders of magnitude lower power density (all else being equal) versus deuterium-tritium. Thus, most fusion efforts focus on deuterium-tritium despite the radioactivity & tritium concerns. My next post will discuss the fuel cycle trade-off as part of the power density discussion.

Anti-perfectionism advice

I used to subscribe to the Mark Twain philosophy:

It is better to keep your mouth closed and let people think you are a fool than to open it and remove all doubt.
Mark Twain

It was an excuse I used to justify my perfectionism. Here are some antidotes that have helped me overcome this tendency & put my thoughts out here.

A book is never finished. It is only abandoned.
Honoré De Balzac

Ideas get developed in the process of explaining them to the right kind of person. You need that resistance, just as a carver needs the resistance of the wood.
Paul Graham, “Ideas“