Imagine trying to learn the statistical regularities of video data streams. If we had to search for regularities by considering the probability distribution over every possible subset of pixels & their value histories, learning would be prohibitive. Obviously, one would first consider local subsets (adjacent pixels & short time histories). However, that still leaves an intractable number of variables. It also can’t explain learning things that take place over extended periods of time or space. We need to assume that all multivariate correlations can be discovered by detecting bivariate correlations alone. This reduces the size of the search space from factorial in the number of variables to merely quadratic, while also reducing the amount of data necessary to accurately sample distributions of interest. Improvements over quadratic scaling could then come from locality biases (without requiring strict locality limitations).
How can we be sure that we will not miss a trivariate (or higher-arity) correlation? Consider the simplest case in which it is theoretically impossible to discovery a trivariate correlation from constituent bivariant correlations: the relation Z = XOR(X,Y), where X,Y are uniformly and independently distributed binary variables (think about a two-way light switch). In this special case, there are no bivariate traces of the trivariate dependence.
This is a very contrived scenario. Any deviations from an independent, uniform distribution for the joint variable (X,Y) results in non-zero bivariate correlations. We can thus detect the trivariate correlation via the presence of two bivariate ones that share a common variable, without directly having to search for trivariate correlations. In practice, the fact that our sampling frequencies won’t in general be equal to the ideal probabilities ensures that such contrived situations are totally irrelevant. At a higher level, one can view this as a mechanism to benefit from distributional shifts. which tend to confound contemporary deep neural networks, much as dither noise is useful for improving the performance of analog-to-digital conversion.
For example, suppose that we happened to experience a run of X = [0,0,0,0,0] while Y = [1,0,1,0,1]. In this sampled distribution, Y and Z will be perfectly correlated. This would alert us to the presence of a correlation between them. If we then experienced the same series of values for Y while having a run of 1’s for X, we could see that the correlation between Y and Z was reversed. We might thus hypothesize two cases: (1) when X = 0, Y = Z, (2) when X=1, Y = NOT(Z), which correctly explains the XOR.
Well, you also need experiments
If we have a fairly low threshold for detection of correlations, then we’re bound to get lots of false positives. (That is to say, the clustering illusion and apophenia may be natural consequences of this strategy. It’s a good sign when your model has the same failure modes as the human brain.) We don’t necessarily have our entire history of observations available, so we will generally need to take fresh data. Anecdotally, if I’m not able to test out a hypothesis quickly, I tend to forget about it. It’s mentally draining to juggle multiple competing hypotheses. We have limited capacity to remain uncertain about concepts, so we need to settle questions promptly.
These don’t necessarily need to be experiments, for the following reason. Even given the same data, different learners may develop different hypothesized correlations. Therefore, curated data that is correlated with the hypothesis that the learner is currently entertaining will be more valuable than random data. Of course, by allowing experiments, we also gain the ability to draw causal conclusions, beyond mere correlation conclusions.
Why is causality important? It comes back to distributional shifts, or non-stationary distributions. Causal models have a structure like P(Z|X,Y)*P(X,Y), where it is asserted that the distribution P(X,Y) may be changed, but the conditional P(Z|X,Y) is invariant. If a phenomenological model learns a representation like P(X,Y|Z)*P(Z), any change to P(X,Y) may necessitate re-learn the entire joint distribution P(X,Y,Z), whereas a casual model would learn the invariant conditional & only need to relearn the smaller distribution P(X,Y). When applied to situations with large causal networks, this results in catastrophic forgetting for phenomenological models when the independent variables have distributions that are non-stationary.
Acknowledgements
In a very intriguing article “Why does deep and cheap learning work so well?”, Lin & Tegmark suggest that neural networks are effective at learning because their structures mirror the structure of physical laws in multiple ways. In particular, they remark that the highest-arity polynomials in the standard model Hamiltonian of particle physics are teriary (N=4). I don’t think this part of their argument holds much water, but undoubtedly the existence of structure to detect in the first place arises from physical laws.
Another inspiration was the integrated-information theory (IIT) of consciousness. While I wholeheartedly agree with Scott Aaronson’s dismantling of the proponent’s claims to have defined consciousness mathematically, I think they are groping towards a way to define non-trivial information processing (computation) as distinct from mere transmission (computation). It comes down to arity: communication is about N=2, but computation is about N>2. Arguably the best paper so far mathematizing this idea is “Quantifying Unique Information” by Bertschinger et al. However, the same idea came up many years ago in the context of probabilistic databases, and also in information geometry.
A third inspiration was Greg Ver Steeg’s discussion of the “grue problem” and articles about explaining total correlations. Ver Steeg’s approach is interesting because it starts with latent variables, rather than directly comparing raw data variables to one another, which is arguably closer to what we know about the brain’s structure than what I’m discussing.
This ties into an interesting remark from Fredrik Linåker’s dissertation: there’s a chicken & egg problem if you think about learning as trying to compress sensory information. The learner has finite memory, so the compression needs to happen first, but without the benefit of hindsight, it’s not possible to know what is relevant for later learning & thus needs to be represented and stored. In particular, Linåker observes that lossy compression amounts to memorizing the common cases and ignoring outliers, while assuming that the data distribution is stationary. However, the objective of lossy compression is to minimize the amount of information it takes to represent the world given a certain degree of acceptable error. This is not the right objective for learning, although reducing the size of the representation is part of both. The difference is that lossy compression is essentially driven to represent the data primarily in terms of the univariate marginal distributions. For learning, this is less important, and the structure (ie, information, correlations) that exists at higher arity comes to the foreground.
I found out after drafting this post that Yoshua Bengio has been thinking along much the same lines. Also, transformer networks seem to embody similar ideas about sparse detection of correlations.