7

It has been proven in the paper "Approximation by Superpositions of a Sigmoidal Function" (by Cybenko, in 1989) that neural networks are universal function approximators. I have a related question.

Assume the neural network's input and output vectors are of the same dimension $n$. Consider the set of **binary-valued** functions from $\{ 0,1 \}^n$ to $\{ 0,1 \}^n$. There are $(2^n)^{(2^n)}$ such functions. The number of parameters in a (deep) neural network is **much smaller** than the above number. Assume the network has $L$ layers, each layer is $n \times n$ fully-connected, then the total number of weights is $L \cdot n^2$.

If the number of weights is **not** allowed to grow exponentially as $n$, can a deep neural network approximate **all** the binary-valued functions of size $n$?

Cybenko's proof seems to be based on the **denseness** of the function space of neural network functions. But this denseness does not seem to guarantee that a neural network function exists when the number of weights are polynomially bounded.

I have a theory. If we replace the activation function of an ANN with a polynomial, say cubic one，then after $L$ layers, the composite polynomial function would have degree $3^L$. In other words, the degree of the total network grows exponentially. In other words, its "complexity" measured by the number of zero-crossings, grows exponentially. This seems to remain true if the activation function is sigmoid, but it involves the calculation of the "topological degree" (a.k.a. mapping degree theory), which I have not the time to do yet.

According to my above theory, the VC dimension (roughly analogous to the zero-crossings) grows exponentially as we add layers to the ANN, but it cannot catch up with the *doubly* exponential growth of Boolean functions. So the ANN can only represent a fraction of all possible Boolean functions, and this fraction even diminishes exponentially. That's my current conjecture.

Thanks for your answer, still trying to digest it. From my naive understanding, a function from D -> D is the same as D^D. For n = 2, my counting is 4^4 = 256 but your counting is 2^4 = 16. There's a huge difference. For each function, we need to define f(0), f(1), f(2), f(3), ie 4 numbers. Each number can be one of {0,1,2,3}. So we have 4^4 = 256 combinations. – Yan King Yin – 2018-08-09T07:00:13.333

Thanks, but I still think you counted wrong. What you have is the number of Boolean functions on {0,1}^n, which is 2^(2^n). I found this answer on the web. But now we have n such outputs!! That makes the count equal to the above answer^n = [2^(2^n)]^n = 2^(n 2^n) = (2^n)^(2^n) which agrees with my original count. – Yan King Yin – 2018-08-16T11:31:55.417

PS: the link to that answer is: https://math.stackexchange.com/questions/1895498/possible-boolean-functions-in-n-dimensional-space

– Yan King Yin – 2018-08-16T11:38:23.780Yes, but you're just asking the same question as that question, and the well-established answer is 2^(2^n) :) – Yan King Yin – 2018-08-16T11:53:07.133

The catch is here: if you have a truth table, you have to fill out EVERY row of the table to get ONE function. For each row there are 2 ways, ie {0,1}, to fill it. So there are 2^(2^n) different ways to fill those truth tables. – Yan King Yin – 2018-08-16T11:57:48.310

OK, I can understand your derivation now, it assumes finite-precision floating-point and you use an argument based on counting the number of bits of information. Which is clever, but what if I assume infinite precision? My argument based on "degree theory" also suggests the # layers would grow exponentially. Which, at least, agrees with your conclusion :) – Yan King Yin – 2018-08-16T17:01:54.633