The universal approximation puzzle
Every student of machine learning encounters the universal approximation theorem (UAT) early when studying neural networks. The version most people remember is informal: a sufficiently wide neural network can approximate any reasonable function. The formal version is sharper, but says the same thing in technical language. Formal statement. For any continuous on a compact and any , there exists a single-hidden-layer feedforward network with a non-polynomial activation that approximates uniformly within on . Hornik, Stinchcombe, and White (Hornik et al., 1989) proved it for arbitrary squashing activations; Cybenko (Cybenko, 1989) gave the sigmoid case the same year; Leshno et al. (Leshno et al., 1993) extended it to any non-polynomial activation.
UAT speaks to expressivity: whether the function class even contains a good approximator of the target function . It does not speak to learnability, whether stochastic gradient descent (SGD) on a finite training set recovers a good from that class — and it does not speak to generalization, whether the recovered performs on unseen data drawn from the same distribution. Expressivity: the class contains a good approximator. Learnability: a finite-sample procedure finds it. Generalization: the found model performs on unseen data. UAT addresses expressivity only. The other two are where most of the engineering happens.
We ended the previous chapter by asking ourselves why deep tabular models didn’t overtake tree-based models by the end of the 2010s when the same thing happened in other domains like vision and text. Our goal in this chapter is to build a good understanding of why — what the MLP’s implicit prior actually looks like, and where it goes wrong on tabular structure. We will explore three mismatches between the MLP prior and tabular data: rotational invariance, spectral bias, and irrelevant-feature sensitivity. Each one is supported by both a theoretical argument and a controlled empirical test.
The MLP priors that fail on tables
Deep learning methods leverage representation learning The canonical reference on representation learning’s goals is Bengio, Courville, and Vincent’s 2013 review (Bengio et al., 2013). to learn from raw data. The representations learned are heavily influenced by the architecture of the model used, and are the principal explanation for why deep learners work so well in certain domains. Convolutions exploit translation invariance and transformers exploit positional order, both build representations on top of these core abstractions. But in tabular data there is no geometric relation between columns and there is no implicit ordering to exploit; the architecture has to commit to a prior of its own rather than borrow one from the substrate.
We will begin by inspecting the implicit prior of the MLP itself. Three aspects of that prior matter for our discussion here. The first is about the hypothesis space (what functions the model is biased toward), where an asymmetry separates axis-aligned (trees) from rotation-symmetric (MLPs) models. The second is about the optimization dynamics on that hypothesis space — which functions get learned quickly vs. slowly with SGD. The third is a corollary on the smoothness imposed by MLPs, which tends to round “cliffs” into ramps and also struggles to drive irrelevant-column weights to zero.
Rotational invariance
Primarily, we are interested in rotational invariance as a property of our learning procedure We will be abusing some mathematical notation here as is not deterministic, this same argument can be extended to input/output distributions to accommodate for that.. More concretely we take a training set on , pre-multiply every input by an orthogonal matrix An orthogonal matrix satisfies : its columns are mutually orthogonal unit vectors. Geometrically is a rotation, a reflection, or a composition of the two, i.e. it preserves every distance and angle in the input space. The inner-product geometry of the features is unchanged but the coordinate axes are scrambled into linear combinations of the originals. to obtain a rotated dataset , retrain the algorithm on the rotated data, and evaluate the new classifier at the rotated test point . If for every the rotated-and-retrained classifier evaluated at agrees with the original classifier evaluated at , the algorithm is rotationally invariant (Ng, 2004):
To emphasize again, this is not a property of the model but instead a property of the learning algorithm, clearly for a fixed trained network. If a learning procedure is rotationally invariant in this sense, then rotating the input space and retraining is, from the algorithm’s perspective, the same as not rotating at all.
MLPs are rotationally invariant
A generic MLP variant with L2-weight-decay using SGD is a member of this rotationally-invariant class of learners:
(i) Symmetric initialisation. Standard initialization schemes (Xavier, He) sample weights from distributions that depend only on layer width, not on which input dimension a given weight connects to. The init has no view on which axis is which, so it commits no bias toward one rather than another.
(ii) L2 weight decay. The penalty depends on only through its Euclidean norm, and orthogonal rotations preserve Euclidean norms:
thus, weight decay imposes the same penalty on as on any rotated copy.
(iii) Rotation-equivariant gradients. A gradient step on rotated inputs is the rotated version of the same step on the original inputs:
Combined with (i) and (ii), this preserves the property along the entire training trajectory. To put it plainly, if we rotate the inputs and retrain, we recover a rotated copy of the original network with identical predictions on the rotated test point.
This is a contrast to tree-based learners. A split asks one question about one column at a time (x_5 ≤ 0.7), so individual split decisions are always axis aligned. A rotation very clearly messes up the input space and the same decision tree cannot be recovered given this new input space, as the axes are completely different!
The formal cost
Rotational invariance has a sample-complexity cost, which is nicely illustrated by Ng (2004). Ng analyses logistic regression on bounded inputs . Out of features, only are genuinely informative, each carrying a weight bounded the remaining features are pure noise. The question Ng asks: how does sample complexity scale as the total dimension grows while stays fixed?
A clear contrast emrges when we compare L1 and L2 regularisers under rotation for the same logistic regression task. L2 keeps weights small; it is the regulariser implicit in most MLP training. L1 keeps weights small and pushes them exactly to zero where possible, so it doubles as feature selection (this is the LASSO formulation in linear regression). The resulting weight vectors are sparse and axis-aligned, the same shape of solution a tree’s splits produce one column at a time. L2 leaves every coordinate active in the prediction, much like an MLP. While both regularize the weights, their behavior differs under rotation. For any orthogonal :
The L1 and L2 unit balls in 2D. The L1 ball (left) has corners at the coordinate axes; rotating it 30° (dashed) moves those corners off the axes. The L2 ball (right) is a circle, and any rotation maps it back to itself. L1 regularisation pulls solutions toward the corners of its ball, which are exactly the axis-aligned sparse weight vectors. Rotation breaks that bias for L1 but leaves L2 unchanged.
The L2 ball is round; the L1 “ball” has corners at the axes. The L2 unit ball is a sphere, symmetric under any rotation. The L1 unit ball is an octahedron whose corners point along the coordinate axes. L2 regularisation just says “keep small”; L1 says “keep small and sparse.” Pulling solutions toward the L1 corners is exactly what makes L1 select features, and what makes it depend on which axis is which. Rotation leaves L2 regularisation indifferent. Rotation moves L1’s corners off the axes and the bias toward axis-aligned sparsity disappears. Only a regulariser that is asymmetric in the input axes can select features along the axes, and rotational invariance precludes that. Ng proves both an upper and a lower bound:
L1-regularised logistic regression (Theorem 3.1) is dimension-tolerant. With , , , fixed,
training examples suffice for excess logloss at most with probability . Logloss controls 0/1 misclassification too: Ng shows , so the same training-set bound applies to misclassification. The dependence on the total dimension is only logarithmic. Adding noise features inflates the requirement by roughly examples, so doubling the feature count costs a constant handful of extra examples instead of doubling the requirement. A hundred-example training set can already cope with a thousand-dimensional input where only one coordinate is informative, the regime shown empirically in the figure below.
Any rotationally invariant learner (Theorem 4.3) is not. There exists at least one problem of the form , where labels are a deterministic threshold on a single coordinate, on which the learner needs
training examples to attain misclassification error. Worst-case bound. Theorem 4.3 proves there exists a problem of this form requiring samples; it does not say every problem in this regime is this hard. Real datasets may sit inside the worst case, but no rotationally invariant algorithm can escape the linear-in- scaling across the full distribution class. Sample complexity grows linearly in the total dimension. The gap between and is exponential, and Ng’s proof shows it is structural rather than a tuning artefact.
Schematic redraw of Ng (2004) Figure 1a. With training examples and only one of features genuinely relevant (, all other , inputs multivariate normal), L1-regularised logistic regression (solid) stays near zero error as the total feature count grows toward 1000. L2-regularised logistic regression (dashed), like every other rotationally invariant learner, collapses toward chance accuracy (0.5) as the feature count grows. The curves are stylised; the direction and magnitude of the gap are Ng’s.
Intuitively a rotationally invariant algorithm cannot tell which input axis is which. To ignore noise features it must first work out which combinations of inputs correspond to the original columns, then select among them. That orientation-finding step is what the bound charges for. Trees skip the bill, since the column index gives them the orientation directly from the schema. MLPs pay it in full. An MLP trained by backprop with spherically symmetric weight init (Xavier, He, all qualify), rotation-symmetric loss, and L2 weight decay satisfies the lower bound’s preconditions.
Rotation is symmetric for the MLP and asymmetric for the tree. The 0° axis-aligned boundary on the left is a single split for the tree (solid black) and a smooth two-unit boundary for the MLP (dashed). Rotate the same data 45° (right) and the tree has to approximate the diagonal with a staircase of axis-aligned splits whose depth grows with the desired resolution; the MLP boundary is unchanged. Tabular features tend to come pre-aligned to semantic axes, so the left panel is the regime trees are built for and the right panel is the one MLPs are indifferent to.
Grinsztajn et al. (Grinsztajn et al., 2022) confirm the asymmetry empirically. On their 45-dataset benchmark they apply random orthogonal rotations to the feature space and retrain everything: under rotation, “NNs are now above tree-based models and ResNets above FT-Transformers.” Exactly what Ng’s bound predicts. The reason it matters in practice is that real tabular columns like age, income, or has_prior_claim each carry standalone semantic meaning, and rotating them away produces combinations like that have no schema-level interpretation at all. Trees exploit the schema; MLPs cannot.
Spectral bias
Rotational-invariance failure was about how vast the hypothesis space the MLP search is, spectral bias is about the order in which different components of the target get learned by the network. Gradient descent on an MLP fits the smooth components of a target first, and the sharp components much later or not at all. Tabular targets are piecewise-constant in the column basis which makes the targets in tabular prediction. heavily biased towards high frequency components
To talk about an MLP’s spectral bias on a tabular target, we take the Fourier Transform:
Low frequencies describe gradual change across a feature and high frequencies describe sharp transitions and oscillation. The cliffs tabular targets actually have (a step at age > 65, a flip at has_prior_claim ∈ {0,1}) sit at the high end of the spectrum.
(Rahaman et al., 2019) makes a precise statement for how quickly MLPs learn such cliffs. For a ReLU MLP under gradient flow on the squared loss, the residual amplitude at frequency contracts at rate with : low-frequency components collapse fast, high-frequency components take much longer to resolve! The formal statement is for ReLU networks under continuous gradient flow on full-batch squared loss with synthetic sinusoidal targets (Rahaman et al., 2019). SGD with batch noise is not formally covered, though the the paper provides experiments on MNIST and CIFAR-10 which are consistent with formal bound. Intuitively, a low-frequency target gives every parameter a coherent direction to step in, so gradients add up across the input; a fast-oscillating target pulls neighbouring inputs the opposite way per parameter, so gradients cancel. This work has wonderful connections to the Neural Tangent Kernel (NTK), which is a whole topic onto itself and is unfortunately out of scope for us here. The NTK is (Jacot et al., 2018). In the wide-network limit, gradient flow on the squared loss is equivalent to kernel regression with , so each eigendirection of is fit at a rate proportional to its eigenvalue. That is what gives Rahaman’s bound.
Basri et al. (Basri et al., 2020) extends the picture for non-uniform input. To fit a sinusoid of frequency on a depth-2 ReLU network in the NTK regime, the iteration count is , where is the minimum density of the input distribution over the region the decision boundary cuts through. The bound is for full-batch GD; the higher-dimensional generalisation is conjectured with empirical support rather than rigorously proven (Basri et al., 2020). Tabular columns are rarely uniform, and the cliffs that matter (if age > 65) may sit in the sparse tails where shrinks and the bound explodes.
What this means for tabular data: Real tabular targets are often piecewise-constant in the columns:
- credit cliffs at score thresholds,
- tier boundaries at price tiers,
- age cutoffs,
- threshold-coded categoricals like
has_prior_claim.
The Fourier expansion of an isolated step has slow-decaying coefficients (), so the spectral tail is always there. Trees match this natively, because every split is a piecewise-constant decision. MLPs land smooth approximations and pay the cost in proportion to how sharp each cliff is.
A piecewise-constant target on a single tabular feature (age, four schematic levels at the cutoffs typical of cliff-coded categoricals). The tree fit traces the cliffs; the MLP fit smooths them into ramps and misses each one in proportion to its sharpness (given limited training data). The amplitude spectrum on the right: the target’s Fourier coefficients decay slowly (the long tail of many step functions), the tree spectrum can matches the tail, and the MLP spectrum drops off at higher frequencies as those are not learned (yet).
It’s easy to find approaches in literature which implicitly try to tackle this spectral bias. Tancik et al. (Tancik et al., 2020) inject random Fourier features in front of a coordinate MLP, shows lift on image regression (not tabular but similar problems with cliffs). Gorishniy et al. (Gorishniy et al., 2022) adapt the same idea to tabular by replacing scalar numerical inputs with periodic or piecewise-linear embeddings, shows lift on 11 tabular benchmarks Gorishniy et al. explicitly do not attribute the mechanism to spectral bias. Beyazit et al. (Beyazit et al., 2023) take the opposite direction, smoothing the target representation rather than enriching the input, and report gains across 14 benchmarks. Both directions of spectral remediation help, which speaks to the bias we discussed above.
Irrelevant features
Axis-aligned methods pay nothing to ignore a noise column, because the column is never chosen as a split. Smooth methods pay for every weight they cannot drive cleanly to zero, and the cost compounds across columns.
Trees do not have to learn to ignore a noise column. The split-finding routine walks each candidate column at each node and picks the one with highest information gain; a column carrying only noise scores zero on that criterion and is never entered. An MLP cannot make that choice. It has a weight on every input dimension and updates each one with the empirical gradient, and finite-sample noise gives even an irrelevant column a non-zero gradient on . L2 weight decay pulls the residual toward zero, but the L2 gradient vanishes at , so the optimiser settles at a small-but-nonzero value. This is the same axis-asymmetry that drove §2.1’s rotation argument, surfacing here as an optimisation-time consequence. At the optimum, the L2 penalty gradient and the empirical-loss gradient must cancel: , so , non-zero whenever the noise gradient is. L1’s penalty has subdifferential at zero, which can cancel any noise gradient of magnitude at most while holding the weight exactly at zero.
Grinsztajn et al. (Grinsztajn et al., 2022) add Gaussian-noise columns to their 45-dataset benchmark and the MLP-tree gap widens; remove the least-important columns and the gap narrows. The degradation is worse for “smoother” methods: MLP is the works, then ResNet, then FT-Transformer. Cherepanova et al. (Cherepanova et al., 2023) provides a dedicated benchmark on 12 datasets and three corruption modes (Gaussian-noise columns, partial-information corruption , and pairwise-product distractors) at 50% and 75% extraneous-feature ratios. The 12 datasets are a smaller sample than Grinsztajn’s 45, and the noise-feature construction is synthetic. Real enterprise tables more often contain informative-but-weak columns rather than pure noise, which is a different stress test (Cherepanova et al., 2023). MLPs degrade markedly under all three. Cherepanova et al. report that “the FT-Transformer model is roughly as robust to noisy features as the XGBoost model.” The failure is not a uniform deep-tabular problem; it is the MLP’s smooth-function prior, and attention is enough to fix most of it.
Schematic test-set degradation as the share of irrelevant-feature contamination grows, summarising the empirical pattern in Grinsztajn et al. (45 datasets, uninformative-features experiment) and Cherepanova et al. (12 datasets, three corruption modes). XGBoost and FT-Transformer stay close together; the MLP drops sharply with each added column. The -axis is qualitative.
The irrelevant-features critique is not a critique of all deep tabular methods. It is a critique of the smooth-function prior for tabular data. Architectures that softly bypass the prior close most of this gap: attention’s column-level softmax acts as an implicit per-instance feature selector, For each token, attention computes a softmax-weighted average over other tokens (in tabular: other columns), so for any given row the layer can sharply suppress an irrelevant column’s contribution while keeping informative ones routed through. The selection is data-dependent and column-level: a soft analog of the tree’s hard split-time choice with no axis-alignment hard-coded into the architecture. selecting which columns weigh on each prediction without committing to a fixed subset. Moreover, explicit-structure attempts bake axis-alignment into a differentiable model (Net-DNF (Katzir et al., 2021), DNF-Net (Abutbul et al., 2021), GRANDE (Marton et al., 2024)) and do narrow the gap, but they pay for the explicit structure in optimisation difficulty. Attention closes more of the gap with less structural commitment. The pattern recurs through much of the deep tabular methodology literature.
Where to go next
Trees encode the target as a hierarchical axis-aligned piecewise-constant function over heterogeneous features, with sparse dependence on the informative subset. MLPs encode the target as a smooth function of Euclidean distance in a learned feature space. Both are honest assumptions a learner has to make, and we have shown both theoretical and empirical evidence that trees have the edge in this setting.
On images, audio, and text, the MLP prior is wrong on raw inputs and right on intermediate learned representations (given that we tweak the network to get there). CNNs and transformers exist to transform raw inputs into representations where Euclidean distance is meaningful, and once that transformation is done, the MLP head on top works well! On tables it’s not as easy — columns are already semantic (age, income, has_prior_claim), and they sit on the axes the tree prior cares about.
A broad range of literature was developed to tackle the problems discussed at length in this chapter, we can understand the sequence of deep tabular-architectures developed as a sequence of bets on which parts of the bias are the most important to resolve: bake the prior in directly, borrow attention to bypass rotation invariance, fix the input representation, fix the training discipline, replace fitting with retrieval, or pretrain the prior from data. All of these approaches cumulated in the development of tabular/relational foundation models. We will take a closer look in Chapter 3.