Home

Revisiting the MLP prior

Taking a closer look at the vanilla MLP's prior, which turns out to be poorly suited for tabular data. Rotational invariance, spectral bias, and (non)-resilience to irrelevant-feature are the culprits.

The universal approximation puzzle

Every student of machine learning encounters the universal approximation theorem (UAT) early when studying neural networks. The version most people remember is informal: a sufficiently wide neural network can approximate any reasonable function. The formal version is sharper, but says the same thing in technical language. Formal statement. For any continuous f:KRf: K \to \mathbb{R} on a compact KRdK \subset \mathbb{R}^d and any ϵ>0\epsilon > 0, there exists a single-hidden-layer feedforward network with a non-polynomial activation that approximates ff uniformly within ϵ\epsilon on KK. Hornik, Stinchcombe, and White (Hornik et al., 1989) proved it for arbitrary squashing activations; Cybenko (Cybenko, 1989) gave the sigmoid case the same year; Leshno et al. (Leshno et al., 1993) extended it to any non-polynomial activation.

UAT speaks to expressivity: whether the function class Fθ={fθ:RdR}\mathcal{F}_\theta = \{f_\theta : \mathbb{R}^d \to \mathbb{R}\} even contains a good approximator of the target function ff^\star. It does not speak to learnability, whether stochastic gradient descent (SGD) on a finite training set {(xi,yi)}i=1n\{(x_i, y_i)\}_{i=1}^n recovers a good fθf_\theta from that class — and it does not speak to generalization, whether the recovered f^\hat{f} performs on unseen data drawn from the same distribution. Expressivity: the class contains a good approximator. Learnability: a finite-sample procedure finds it. Generalization: the found model performs on unseen data. UAT addresses expressivity only. The other two are where most of the engineering happens.

We ended the previous chapter by asking ourselves why deep tabular models didn’t overtake tree-based models by the end of the 2010s when the same thing happened in other domains like vision and text. Our goal in this chapter is to build a good understanding of why — what the MLP’s implicit prior actually looks like, and where it goes wrong on tabular structure. We will explore three mismatches between the MLP prior and tabular data: rotational invariance, spectral bias, and irrelevant-feature sensitivity. Each one is supported by both a theoretical argument and a controlled empirical test.

The MLP priors that fail on tables

Deep learning methods leverage representation learning The canonical reference on representation learning’s goals is Bengio, Courville, and Vincent’s 2013 review (Bengio et al., 2013). to learn from raw data. The representations learned are heavily influenced by the architecture of the model used, and are the principal explanation for why deep learners work so well in certain domains. Convolutions exploit translation invariance and transformers exploit positional order, both build representations on top of these core abstractions. But in tabular data there is no geometric relation between columns and there is no implicit ordering to exploit; the architecture has to commit to a prior of its own rather than borrow one from the substrate.

We will begin by inspecting the implicit prior of the MLP itself. Three aspects of that prior matter for our discussion here. The first is about the hypothesis space (what functions the model is biased toward), where an asymmetry separates axis-aligned (trees) from rotation-symmetric (MLPs) models. The second is about the optimization dynamics on that hypothesis space — which functions get learned quickly vs. slowly with SGD. The third is a corollary on the smoothness imposed by MLPs, which tends to round “cliffs” into ramps and also struggles to drive irrelevant-column weights to zero.

Rotational invariance

Primarily, we are interested in rotational invariance as a property of our learning procedure A\mathcal A We will be abusing some mathematical notation here as A\mathcal A is not deterministic, this same argument can be extended to input/output distributions to accommodate for that.. More concretely we take a training set {(xi,yi)}i=1n\{(x_i, y_i)\}_{i=1}^n on Rd\mathbb{R}^d, pre-multiply every input by an orthogonal matrix RR An orthogonal matrix RR satisfies RR=IR^\top R = I: its columns are mutually orthogonal unit vectors. Geometrically RR is a rotation, a reflection, or a composition of the two, i.e. it preserves every distance and angle in the input space. The inner-product geometry of the features is unchanged but the coordinate axes are scrambled into linear combinations of the originals. to obtain a rotated dataset {(Rxi,yi)}\{(Rx_i, y_i)\}, retrain the algorithm on the rotated data, and evaluate the new classifier at the rotated test point RxRx. If for every RR the rotated-and-retrained classifier evaluated at RxRx agrees with the original classifier evaluated at xx, the algorithm A\mathcal{A} is rotationally invariant (Ng, 2004):

A({(Rxi,yi)})(Rx)  =  A({(xi,yi)})(x)for every orthogonal R.\mathcal{A}\big(\{(Rx_i, y_i)\}\big)(Rx) \;=\; \mathcal{A}\big(\{(x_i, y_i)\}\big)(x) \quad \text{for every orthogonal } R.

To emphasize again, this is not a property of the model but instead a property of the learning algorithm, clearly f(Rx)f(x)f(Rx) \neq f(x) for a fixed trained network. If a learning procedure A\mathcal A is rotationally invariant in this sense, then rotating the input space and retraining is, from the algorithm’s perspective, the same as not rotating at all.

MLPs are rotationally invariant

A generic MLP variant with L2-weight-decay using SGD is a member of this rotationally-invariant class of learners:

(i) Symmetric initialisation. Standard initialization schemes (Xavier, He) sample weights from distributions that depend only on layer width, not on which input dimension a given weight connects to. The init has no view on which axis is which, so it commits no bias toward one rather than another.

(ii) L2 weight decay. The penalty depends on θ\theta only through its Euclidean norm, and orthogonal rotations preserve Euclidean norms:

θ2  =  Rθ2for any orthogonal R,\|\theta\|_2 \;=\; \|R\theta\|_2 \qquad \text{for any orthogonal } R,

thus, weight decay imposes the same penalty on θ\theta as on any rotated copy.

(iii) Rotation-equivariant gradients. A gradient step on rotated inputs is the rotated version of the same step on the original inputs:

θL(Rθ)  =  RθL(θ).\nabla_\theta \mathcal{L}(R\theta) \;=\; R\, \nabla_\theta \mathcal{L}(\theta).

Combined with (i) and (ii), this preserves the property along the entire training trajectory. To put it plainly, if we rotate the inputs and retrain, we recover a rotated copy of the original network with identical predictions on the rotated test point.

This is a contrast to tree-based learners. A split asks one question about one column at a time (x_5 ≤ 0.7), so individual split decisions are always axis aligned. A rotation very clearly messes up the input space and the same decision tree cannot be recovered given this new input space, as the axes are completely different!

The formal cost

Rotational invariance has a sample-complexity cost, which is nicely illustrated by Ng (2004). Ng analyses logistic regression on bounded inputs x[1,1]nx \in [-1, 1]^n. Out of nn features, only rr are genuinely informative, each carrying a weight bounded θijKNg|\theta_{i_j}| \le K_{\text{Ng}} the remaining nrn - r features are pure noise. The question Ng asks: how does sample complexity scale as the total dimension nn grows while rr stays fixed?

A clear contrast emrges when we compare L1 and L2 regularisers under rotation for the same logistic regression task. L2 keeps weights small; it is the regulariser implicit in most MLP training. L1 keeps weights small and pushes them exactly to zero where possible, so it doubles as feature selection (this is the LASSO formulation in linear regression). The resulting weight vectors are sparse and axis-aligned, the same shape of solution a tree’s splits produce one column at a time. L2 leaves every coordinate active in the prediction, much like an MLP. While both regularize the weights, their behavior differs under rotation. For any orthogonal RR:

θ2  =  Rθ2,\|\theta\|_2 \;=\; \|R\theta\|_2, θ1    Rθ1in general.\|\theta\|_1 \;\ne\; \|R\theta\|_1 \quad \text{in general}.
θ₁θ₂L1 ballcorners move off axes under rotationθ₁θ₂L2 ballrotation gives back the same ball

The L1 and L2 unit balls in 2D. The L1 ball (left) has corners at the coordinate axes; rotating it 30° (dashed) moves those corners off the axes. The L2 ball (right) is a circle, and any rotation maps it back to itself. L1 regularisation pulls solutions toward the corners of its ball, which are exactly the axis-aligned sparse weight vectors. Rotation breaks that bias for L1 but leaves L2 unchanged.

The L2 ball is round; the L1 “ball” has corners at the axes. The L2 unit ball {θ:θ21}\{\theta : \|\theta\|_2 \le 1\} is a sphere, symmetric under any rotation. The L1 unit ball is an octahedron whose corners point along the coordinate axes. L2 regularisation just says “keep θ\theta small”; L1 says “keep θ\theta small and sparse.” Pulling solutions toward the L1 corners is exactly what makes L1 select features, and what makes it depend on which axis is which. Rotation leaves L2 regularisation indifferent. Rotation moves L1’s corners off the axes and the bias toward axis-aligned sparsity disappears. Only a regulariser that is asymmetric in the input axes can select features along the axes, and rotational invariance precludes that. Ng proves both an upper and a lower bound:

L1-regularised logistic regression (Theorem 3.1) is dimension-tolerant. With rr, KNgK_{\text{Ng}}, ε\varepsilon, δ\delta fixed,

m  =  Ω ⁣((logn)poly ⁣(r,KNg,1ε,log1δ))m \;=\; \Omega\!\left( (\log n)\cdot \mathrm{poly}\!\left(r,\, K_{\text{Ng}},\, \tfrac{1}{\varepsilon},\, \log\tfrac{1}{\delta}\right) \right)

training examples suffice for excess logloss at most ε\varepsilon with probability 1δ1 - \delta. Logloss controls 0/1 misclassification too: Ng shows εl(θ)(log2)εm(θ)\varepsilon_l(\theta) \ge (\log 2) \cdot \varepsilon_m(\theta), so the same training-set bound applies to misclassification. The dependence on the total dimension nn is only logarithmic. Adding KK noise features inflates the requirement by roughly logK\log K examples, so doubling the feature count costs a constant handful of extra examples instead of doubling the requirement. A hundred-example training set can already cope with a thousand-dimensional input where only one coordinate is informative, the regime shown empirically in the figure below.

Any rotationally invariant learner (Theorem 4.3) is not. There exists at least one problem of the form y=1[x1t]y = \mathbf{1}[x_1 \ge t], where labels are a deterministic threshold on a single coordinate, on which the learner needs

m  =  Ω ⁣(nε)m \;=\; \Omega\!\left( \tfrac{n}{\varepsilon} \right)

training examples to attain ε\varepsilon misclassification error. Worst-case bound. Theorem 4.3 proves there exists a problem of this form requiring Ω(n/ε)\Omega(n/\varepsilon) samples; it does not say every problem in this regime is this hard. Real datasets may sit inside the worst case, but no rotationally invariant algorithm can escape the linear-in-nn scaling across the full distribution class. Sample complexity grows linearly in the total dimension. The gap between logn\log n and nn is exponential, and Ng’s proof shows it is structural rather than a tuning artefact.

chance (0.5)L1-LRL2-LR1101001000total feature count n (log scale)00.250.5misclassification error →L1 vs L2 sample complexity (one relevant feature, m=100)

Schematic redraw of Ng (2004) Figure 1a. With m=100m = 100 training examples and only one of nn features genuinely relevant (θ1=10\theta_1 = 10, all other θi=0\theta_i = 0, inputs multivariate normal), L1-regularised logistic regression (solid) stays near zero error as the total feature count nn grows toward 1000. L2-regularised logistic regression (dashed), like every other rotationally invariant learner, collapses toward chance accuracy (0.5) as the feature count grows. The curves are stylised; the direction and magnitude of the gap are Ng’s.

Intuitively a rotationally invariant algorithm cannot tell which input axis is which. To ignore nrn - r noise features it must first work out which combinations of inputs correspond to the original columns, then select among them. That orientation-finding step is what the Ω(n/ε)\Omega(n/\varepsilon) bound charges for. Trees skip the bill, since the column index gives them the orientation directly from the schema. MLPs pay it in full. An MLP trained by backprop with spherically symmetric weight init (Xavier, He, N(0,σ2)\mathcal{N}(0, \sigma^2) all qualify), rotation-symmetric loss, and L2 weight decay satisfies the lower bound’s preconditions.

x₁ ≤ cx₁x₂0° axis-alignedrotate 45°x₁x₂45°-rotated

Rotation is symmetric for the MLP and asymmetric for the tree. The 0° axis-aligned boundary on the left is a single split for the tree (solid black) and a smooth two-unit boundary for the MLP (dashed). Rotate the same data 45° (right) and the tree has to approximate the diagonal with a staircase of axis-aligned splits whose depth grows with the desired resolution; the MLP boundary is unchanged. Tabular features tend to come pre-aligned to semantic axes, so the left panel is the regime trees are built for and the right panel is the one MLPs are indifferent to.

Grinsztajn et al. (Grinsztajn et al., 2022) confirm the asymmetry empirically. On their 45-dataset benchmark they apply random orthogonal rotations to the feature space and retrain everything: under rotation, “NNs are now above tree-based models and ResNets above FT-Transformers.” Exactly what Ng’s bound predicts. The reason it matters in practice is that real tabular columns like age, income, or has_prior_claim each carry standalone semantic meaning, and rotating them away produces combinations like 0.7age+0.3income0.7 \cdot \text{age} + 0.3 \cdot \text{income} that have no schema-level interpretation at all. Trees exploit the schema; MLPs cannot.

Spectral bias

Rotational-invariance failure was about how vast the hypothesis space the MLP search is, spectral bias is about the order in which different components of the target get learned by the network. Gradient descent on an MLP fits the smooth components of a target first, and the sharp components much later or not at all. Tabular targets are piecewise-constant in the column basis which makes the targets in tabular prediction. heavily biased towards high frequency components

To talk about an MLP’s spectral bias on a tabular target, we take the Fourier Transform:

f(x)=kf^(k)e2πikx.f(x) = \sum_{\mathbf{k}} \hat{f}(\mathbf{k})\, e^{2\pi i \mathbf{k} \cdot x}.

Low frequencies k\mathbf{k} describe gradual change across a feature and high frequencies describe sharp transitions and oscillation. The cliffs tabular targets actually have (a step at age > 65, a flip at has_prior_claim ∈ {0,1}) sit at the high end of the spectrum.

(Rahaman et al., 2019) makes a precise statement for how quickly MLPs learn such cliffs. For a ReLU MLP under gradient flow on the squared loss, the residual amplitude at frequency k\mathbf{k} contracts at rate O(kΔ)\mathcal{O}(k^{-\Delta}) with 1Δd1 \le \Delta \le d: low-frequency components collapse fast, high-frequency components take much longer to resolve! The formal statement is for ReLU networks under continuous gradient flow on full-batch squared loss with synthetic sinusoidal targets (Rahaman et al., 2019). SGD with batch noise is not formally covered, though the the paper provides experiments on MNIST and CIFAR-10 which are consistent with formal bound. Intuitively, a low-frequency target gives every parameter a coherent direction to step in, so gradients add up across the input; a fast-oscillating target pulls neighbouring inputs the opposite way per parameter, so gradients cancel. This work has wonderful connections to the Neural Tangent Kernel (NTK), which is a whole topic onto itself and is unfortunately out of scope for us here. The NTK is Θ(x,x)=θf(x;θ0),θf(x;θ0)\Theta(x, x') = \langle \nabla_\theta f(x; \theta_0), \nabla_\theta f(x'; \theta_0) \rangle (Jacot et al., 2018). In the wide-network limit, gradient flow on the squared loss is equivalent to kernel regression with Θ\Theta, so each eigendirection of Θ\Theta is fit at a rate proportional to its eigenvalue. That is what gives Rahaman’s bound.

Basri et al. (Basri et al., 2020) extends the picture for non-uniform input. To fit a sinusoid of frequency κ\kappa on a depth-2 ReLU network in the NTK regime, the iteration count is O~(κ2/p)\tilde{O}(\kappa^2 / p_*), where pp_* is the minimum density of the input distribution over the region the decision boundary cuts through. The bound is for full-batch GD; the higher-dimensional generalisation O(kd/p)O(k^d / p_*) is conjectured with empirical support rather than rigorously proven (Basri et al., 2020). Tabular columns are rarely uniform, and the cliffs that matter (if age > 65) may sit in the sparse tails where pp_* shrinks and the bound explodes.

What this means for tabular data: Real tabular targets are often piecewise-constant in the columns:

  • credit cliffs at score thresholds,
  • tier boundaries at price tiers,
  • age cutoffs,
  • threshold-coded categoricals like has_prior_claim.

The Fourier expansion of an isolated step has slow-decaying coefficients (f^(k)1/k|\hat f(k)| \sim 1/k), so the spectral tail is always there. Trees match this natively, because every split is a piecewise-constant decision. MLPs land smooth approximations and pay the cost in proportion to how sharp each cliff is.

ageyfits in feature space1416k|f(k)|amplitude spectrumtargettree fitMLP fit

A piecewise-constant target on a single tabular feature (age, four schematic levels at the cutoffs typical of cliff-coded categoricals). The tree fit traces the cliffs; the MLP fit smooths them into ramps and misses each one in proportion to its sharpness (given limited training data). The amplitude spectrum on the right: the target’s Fourier coefficients decay slowly (the long tail of many step functions), the tree spectrum can matches the tail, and the MLP spectrum drops off at higher frequencies as those are not learned (yet).

It’s easy to find approaches in literature which implicitly try to tackle this spectral bias. Tancik et al. (Tancik et al., 2020) inject random Fourier features in front of a coordinate MLP, shows lift on image regression (not tabular but similar problems with cliffs). Gorishniy et al. (Gorishniy et al., 2022) adapt the same idea to tabular by replacing scalar numerical inputs with periodic or piecewise-linear embeddings, shows lift on 11 tabular benchmarks Gorishniy et al. explicitly do not attribute the mechanism to spectral bias. Beyazit et al. (Beyazit et al., 2023) take the opposite direction, smoothing the target representation rather than enriching the input, and report gains across 14 benchmarks. Both directions of spectral remediation help, which speaks to the bias we discussed above.

Irrelevant features

Axis-aligned methods pay nothing to ignore a noise column, because the column is never chosen as a split. Smooth methods pay for every weight they cannot drive cleanly to zero, and the cost compounds across columns.

Trees do not have to learn to ignore a noise column. The split-finding routine walks each candidate column at each node and picks the one with highest information gain; a column carrying only noise scores zero on that criterion and is never entered. An MLP cannot make that choice. It has a weight on every input dimension and updates each one with the empirical gradient, and finite-sample noise gives even an irrelevant column a non-zero gradient on θj\theta_j. L2 weight decay pulls the residual toward zero, but the L2 gradient 2λθj2\lambda \theta_j vanishes at θj=0\theta_j = 0, so the optimiser settles at a small-but-nonzero value. This is the same axis-asymmetry that drove §2.1’s rotation argument, surfacing here as an optimisation-time consequence. At the optimum, the L2 penalty gradient and the empirical-loss gradient must cancel: 2λθj+gj=02\lambda \theta_j^\star + g_j = 0, so θj=gj/(2λ)\theta_j^\star = -g_j / (2\lambda), non-zero whenever the noise gradient gjg_j is. L1’s penalty has subdifferential [λ,λ][-\lambda, \lambda] at zero, which can cancel any noise gradient of magnitude at most λ\lambda while holding the weight exactly at zero.

Grinsztajn et al. (Grinsztajn et al., 2022) add Gaussian-noise columns to their 45-dataset benchmark and the MLP-tree gap widens; remove the least-important columns and the gap narrows. The degradation is worse for “smoother” methods: MLP is the works, then ResNet, then FT-Transformer. Cherepanova et al. (Cherepanova et al., 2023) provides a dedicated benchmark on 12 datasets and three corruption modes (Gaussian-noise columns, partial-information corruption xc=0.5x+0.5εx_c = 0.5 x + 0.5 \varepsilon, and pairwise-product distractors) at 50% and 75% extraneous-feature ratios. The 12 datasets are a smaller sample than Grinsztajn’s 45, and the noise-feature construction is synthetic. Real enterprise tables more often contain informative-but-weak columns rather than pure noise, which is a different stress test (Cherepanova et al., 2023). MLPs degrade markedly under all three. Cherepanova et al. report that “the FT-Transformer model is roughly as robust to noisy features as the XGBoost model.” The failure is not a uniform deep-tabular problem; it is the MLP’s smooth-function prior, and attention is enough to fix most of it.

XGBoostFT-TransformerMLP0%25%50%75%share of irrelevant featurestest accuracy →irrelevant-features stress test

Schematic test-set degradation as the share of irrelevant-feature contamination grows, summarising the empirical pattern in Grinsztajn et al. (45 datasets, uninformative-features experiment) and Cherepanova et al. (12 datasets, three corruption modes). XGBoost and FT-Transformer stay close together; the MLP drops sharply with each added column. The yy-axis is qualitative.

The irrelevant-features critique is not a critique of all deep tabular methods. It is a critique of the smooth-function prior for tabular data. Architectures that softly bypass the prior close most of this gap: attention’s column-level softmax acts as an implicit per-instance feature selector, For each token, attention computes a softmax-weighted average over other tokens (in tabular: other columns), so for any given row the layer can sharply suppress an irrelevant column’s contribution while keeping informative ones routed through. The selection is data-dependent and column-level: a soft analog of the tree’s hard split-time choice with no axis-alignment hard-coded into the architecture. selecting which columns weigh on each prediction without committing to a fixed subset. Moreover, explicit-structure attempts bake axis-alignment into a differentiable model (Net-DNF (Katzir et al., 2021), DNF-Net (Abutbul et al., 2021), GRANDE (Marton et al., 2024)) and do narrow the gap, but they pay for the explicit structure in optimisation difficulty. Attention closes more of the gap with less structural commitment. The pattern recurs through much of the deep tabular methodology literature.

Where to go next

Trees encode the target as a hierarchical axis-aligned piecewise-constant function over heterogeneous features, with sparse dependence on the informative subset. MLPs encode the target as a smooth function of Euclidean distance in a learned feature space. Both are honest assumptions a learner has to make, and we have shown both theoretical and empirical evidence that trees have the edge in this setting.

On images, audio, and text, the MLP prior is wrong on raw inputs and right on intermediate learned representations (given that we tweak the network to get there). CNNs and transformers exist to transform raw inputs into representations where Euclidean distance is meaningful, and once that transformation is done, the MLP head on top works well! On tables it’s not as easy — columns are already semantic (age, income, has_prior_claim), and they sit on the axes the tree prior cares about.

A broad range of literature was developed to tackle the problems discussed at length in this chapter, we can understand the sequence of deep tabular-architectures developed as a sequence of bets on which parts of the bias are the most important to resolve: bake the prior in directly, borrow attention to bypass rotation invariance, fix the input representation, fix the training discipline, replace fitting with retrieval, or pretrain the prior from data. All of these approaches cumulated in the development of tabular/relational foundation models. We will take a closer look in Chapter 3.

Abutbul, A., Elidan, G., Katzir, L., & El-Yaniv, R. (2021). DNF-Net: A Neural Architecture for Tabular Data.
Basri, R., Galun, M., Geifman, A., Jacobs, D., Kasten, Y., & Kritchman, S. (2020). Frequency Bias in Neural Networks for Input of Non-Uniform Density. Proceedings of the 37th International Conference on Machine Learning (ICML). ↩₁ ↩₂
Bengio, Y., Courville, A., & Vincent, P. (2013). Representation Learning: A Review and New Perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 1798–1828.
Beyazit, E., Kozaczuk, J., Li, B., Wallace, V., & Fadlallah, B. (2023). An Inductive Bias for Tabular Deep Learning. Advances in Neural Information Processing Systems 36 (NeurIPS). https://openreview.net/forum?id=XEUc1JegGt
Cherepanova, V., Levin, R., Somepalli, G., Geiping, J., Bruss, C. B., Wilson, A. G., Goldstein, T., & Goldblum, M. (2023). A Performance-Driven Benchmark for Feature Selection in Tabular Deep Learning. Advances in Neural Information Processing Systems 36 (NeurIPS) Datasets and Benchmarks Track. ↩₁ ↩₂
Cybenko, G. (1989). Approximation by Superpositions of a Sigmoidal Function. Mathematics of Control, Signals, and Systems, 2(4), 303–314. https://doi.org/10.1007/BF02551274
Gorishniy, Y., Rubachev, I., & Babenko, A. (2022). On Embeddings for Numerical Features in Tabular Deep Learning. Advances in Neural Information Processing Systems 35 (NeurIPS).
Grinsztajn, L., Oyallon, E., & Varoquaux, G. (2022). Why do tree-based models still outperform deep learning on tabular data? Advances in Neural Information Processing Systems 35 (NeurIPS) Datasets and Benchmarks Track. ↩₁ ↩₂
Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer Feedforward Networks are Universal Approximators. Neural Networks, 2(5), 359–366. https://doi.org/10.1016/0893-6080(89)90020-8
Jacot, A., Gabriel, F., & Hongler, C. (2018). Neural Tangent Kernel: Convergence and Generalization in Neural Networks. Advances in Neural Information Processing Systems 31 (NeurIPS).
Katzir, L., Elidan, G., & El-Yaniv, R. (2021). Net-DNF: Effective Deep Modeling of Tabular Data. International Conference on Learning Representations (ICLR).
Leshno, M., Lin, V. Ya., Pinkus, A., & Schocken, S. (1993). Multilayer Feedforward Networks With a Nonpolynomial Activation Function Can Approximate Any Function. Neural Networks, 6(6), 861–867. https://doi.org/10.1016/S0893-6080(05)80131-5
Marton, S., Lüdtke, S., Bartelt, C., & Stuckenschmidt, H. (2024). GRANDE: Gradient-Based Decision Tree Ensembles for Tabular Data. International Conference on Learning Representations (ICLR).
Ng, A. Y. (2004). Feature Selection, L1 vs. L2 Regularization, and Rotational Invariance. Proceedings of the 21st International Conference on Machine Learning (ICML). https://doi.org/10.1145/1015330.1015435
Rahaman, N., Baratin, A., Arpit, D., Draxler, F., Lin, M., Hamprecht, F. A., Bengio, Y., & Courville, A. (2019). On the Spectral Bias of Neural Networks. Proceedings of the 36th International Conference on Machine Learning (ICML). ↩₁ ↩₂
Tancik, M., Srinivasan, P. P., Mildenhall, B., Fridovich-Keil, S., Raghavan, N., Singhal, U., Ramamoorthi, R., Barron, J. T., & Ng, R. (2020). Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains. Advances in Neural Information Processing Systems 33 (NeurIPS).

Cite this post

APA
Stojanovič, B. (2026, May 20). Revisiting the MLP prior. Blaž Stojanovič. https://blazstojanovic.com/blog/revisiting-the-mlp-prior/
BibTeX
@misc{stojanovic2026revisiting,
  author       = {Stojanovi\v{c}, Bla\v{z}},
  title        = {{Revisiting the MLP prior}},
  year         = {2026},
  month        = may,
  howpublished = {\url{https://blazstojanovic.com/blog/revisiting-the-mlp-prior/}},
  note         = {Blog post}
}

References