If you are bored with Deep Networks

Cuchulainn · November 17th, 2017, 10:16 am

Are these articles refereed?
Yes
Yes
No (but the read the referees' comments - openreview.net is a wonderful thing)

I read the 3rd article + reviews.
In general, it feels like "sometimes it works and sometimes not, but we don't know why (?? bad local. intrinsic characteristics, etc.) Figyres are unreadable.
Stupid Q; They use ReLU as rectifiier but AFAIK this is not very smooth and GD needs smooth functions. It's like making a silk purse from a pig's ear?

To T4A.
This article was written with no concern about issues that pernikity mathematicians worry about. Much waste of energy.Thiis is why mathematics is needed: less trial and error.

Das ist nicht nur nicht richtig; es ist nicht einmal falsch!
Wolfgang Pauli

outrun · November 17th, 2017, 11:15 am

Thr nice thing about RELU is that is has either constant or zero gradient. This makes the gradient descent really good. At the same time it is non-linear which is a prerequisite.

Compare the gradient with a that of tanh (or sigmoid) , those have "saturation issues" with "vanishing gradient". The loss surface has many flat regions with those.

Smoothness is not very important, RELU is used at intermediate layers and you have lots of neurons (you can fold the pigs ear many times with little folds). The last layer typically has a smooth activation function and the choice there depends on the type of problem and the associated loss function that gives. Also, the smoothness of the loss function is a function of all (nested) activation functions in all layers. Considering the fact that the loss is based in a finite set of samples the smoothness is typically only required up to some resolution. Two purses shaped with millimeter sized folds looks identical to the average user who's mainly interested in the overall shape: can I hide a can of beer in it

A recent alternative for RELU is SELU which aims to keep the moments that go in and out the layer at the same scale (no dampening or explosions).

Cuchulainn · November 17th, 2017, 11:18 am

Thr nice thing about RELU is that is has either constant or zero gradient. This makes the gradient descent really good. At the same time it is non-linear which is a prerequisite.

Compare the gradient with a that of tanh (or sigmoid) , those have "saturation issues" with "vanishing gradient".

Smoothness is not very important, it's used at intermediate layers and you have lots of neurons (you can fold the pigs ear many times with little folds). The last layer typically has a smooth activation function and the choice there depends on the type of problem and the associated loss function that gives. Also, the smoothness of the loss function is a function of all (needed) activation functions in all layers. Considering the fact that the loss is based in a finite set of samples the smoothness is typically only required up to some resolution.

A recent alternative for RELU is SELU which aims to keep the moments that go in and out the layer at the same scale (no dampening or explosions).

Nice and clear.

outrun · November 17th, 2017, 11:27 am

Thr nice thing about RELU is that is has either constant or zero gradient. This makes the gradient descent really good. At the same time it is non-linear which is a prerequisite.

Compare the gradient with a that of tanh (or sigmoid) , those have "saturation issues" with "vanishing gradient".

Smoothness is not very important, it's used at intermediate layers and you have lots of neurons (you can fold the pigs ear many times with little folds). The last layer typically has a smooth activation function and the choice there depends on the type of problem and the associated loss function that gives. Also, the smoothness of the loss function is a function of all (needed) activation functions in all layers. Considering the fact that the loss is based in a finite set of samples the smoothness is typically only required up to some resolution.

A recent alternative for RELU is SELU which aims to keep the moments that go in and out the layer at the same scale (no dampening or explosions).
Nice and clear.

Thanks

Traden4Alpha · November 17th, 2017, 2:16 pm

Are these articles refereed?
Yes
Yes
No (but the read the referees' comments - openreview.net is a wonderful thing)
I read the 3rd article + reviews.
In general, it feels like "sometimes it works and sometimes not, but we don't know why (?? bad local. intrinsic characteristics, etc.) Figyres are unreadable.
Stupid Q; They use ReLU as rectifiier but AFAIK this is not very smooth and GD needs smooth functions. It's like making a silk purse from a pig's ear?

To T4A.
This article was written with no concern about issues that pernikity mathematicians worry about. Much waste of energy.Thiis is why mathematics is needed: less trial and error.

Das ist nicht nur nicht richtig; es ist nicht einmal falsch!
Wolfgang Pauli

I think we can agree that trial-and-error is a very slow process especially in extremely high-dimension systems and that a single elegant mathematical proof could eliminate the need for untold quadrillions of CPU-years of numerical mucking about.

Where we seem to disagree is whether such math exists. NNs are an example of emergent systems in which the output dynamics are not obvious from the input structures. There are some systems where the best model of the system is the system itself and turning the crank is best strategy even if it seems inefficient. Whitehead & Russell thought that everything could be neatly proved and Godel and Turing showed it couldn't.

BTW, trial-and-error may be less inefficient than first appears. Darwinian evolution has found millions of impressive optima in a 4^150,000,000,000 search space of genomes despite a trial-and-error sample size that is an infinitesimal fraction of the search space size.

Cuchulainn · November 17th, 2017, 3:13 pm

Yes
Yes
No (but the read the referees' comments - openreview.net is a wonderful thing)
I read the 3rd article + reviews.
In general, it feels like "sometimes it works and sometimes not, but we don't know why (?? bad local. intrinsic characteristics, etc.) Figyres are unreadable.
Stupid Q; They use ReLU as rectifiier but AFAIK this is not very smooth and GD needs smooth functions. It's like making a silk purse from a pig's ear?

To T4A.
This article was written with no concern about issues that pernikity mathematicians worry about. Much waste of energy.Thiis is why mathematics is needed: less trial and error.

Das ist nicht nur nicht richtig; es ist nicht einmal falsch!
Wolfgang Pauli
I think we can agree that trial-and-error is a very slow process especially in extremely high-dimension systems and that a single elegant mathematical proof could eliminate the need for untold quadrillions of CPU-years of numerical mucking about.

Where we seem to disagree is whether such math exists. NNs are an example of emergent systems in which the output dynamics are not obvious from the input structures. There are some systems where the best model of the system is the system itself and turning the crank is best strategy even if it seems inefficient. Whitehead & Russell thought that everything could be neatly proved and Godel and Turing showed it couldn't.

BTW, trial-and-error may be less inefficient than first appears. Darwinian evolution has found millions of impressive optima in a 4^150,000,000,000 search space of genomes despite a trial-and-error sample size that is an infinitesimal fraction of the search space size.

That's too high-falutin' Most mathematician ignore both Russel and Turing, so let's leave them in peace. It's a completely different issue here. And not relevant.

The issue is more fundamental. It's so hard to explain as I know it will be incorrectly interpreted. Basically, we want to know the conditions under which a solution exists before proceeding. This is what we learn as undergrads, yes?

Example/Quiz
Steepest descent: what are the assumptions for it to work?

Traden4Alpha · November 17th, 2017, 3:59 pm

That's too high-falutin' Most mathematician ignore both Russel and Turing, so let's leave them in peace. It's a completely different issue here. And not relevant.

They ignore Turing and Godel at their own peril if the system is a complex adaptive one. NAND gates are sufficient to construct a universal computer and a neuron can implement a NAND gate.

The issue is more fundamental. It's so hard to explain as I know it will be incorrectly interpreted. Basically, we want to know the conditions under which a solution exists before proceeding. This is what we learn as undergrads, yes?

Agreed! But lets take your logic one step further: we want to know the conditions under which the proof of existence of a solution exists before proceeding. Clearly, there are some mathematical systems that have such proofs and there are some systems (see Turing and Godel) that don't. Why beat one's head against a mathematical wall of trying to find a proof of a solution if said proof does not exist?

Example/Quiz
Steepest descent: what are the assumptions for it to work?

I was going to say it's the existence of at least one global minimum and a path to said minimum. But then I realized that is wrong because these systems are not trying to find the perfect answer (true global minimum) but to find an acceptable answer (any local minimum that does not suck).

Thus we need to understand the conditions under which these systems become irreversibly trapped in bad local minima.

Cuchulainn · November 17th, 2017, 4:18 pm

Let's try another example: exponentially fitted methods for the Black Scholes PDE and other linear convection-diffusion-reaction PDE are stable for any values of drift and diffusion. Standard FDM fail when convection dominance kicks. We can prove this without having to write a single line of code.

Now, has AI classified input types/categories so that having done that you know what to expect and which methods work? The bespoke arx files seem to suggest not. DL is only a few years old so miracles take longer. Be carefui with hype.

BTW here is a great example of constructivist mathematics

https://en.wikipedia.org/wiki/Banach_fi ... nt_theorem

Maybe this is an eye-opener
https://en.wikipedia.org/wiki/Construct ... thematics)

Why beat one's head against a mathematical wall of trying to find a proof of a solution if said proof does not exist?
It's the future of humanity Jim what's at stake here: Newton did it, Stokes did it, so AI should do it.

katastrofa · November 17th, 2017, 5:12 pm

Dr Phil would say "all behavior, no matter how strange, becomes rational once you know the value system that drives someone"
Phil is so nineties.

What if one's value system is irrational? People can be complex and have their transcendental side, you know... Or what would Dr Phil say about my value system, which goes [$]2^e[$], [$]e^\pi[$], [$]\sqrt{\pi}^\sqrt{2}[$], ...?

Cuchulainn · November 17th, 2017, 5:16 pm

Dr Phil would say "all behavior, no matter how strange, becomes rational once you know the value system that drives someone"
Phil is so nineties.
What if one's value system is irrational? People can be complex and have their transcendental side, you know... Or what would Dr Phil say about my value system, which goes [$]2^e[$], [$]e^\pi[$], [$]\sqrt{\pi}^\sqrt{2}[$], ...?

We have a lot in common. I like [$]e^5[$], Collector is partial to [$]\pi^5[$].

outrun · November 17th, 2017, 5:31 pm

I read the 3rd article + reviews.
In general, it feels like "sometimes it works and sometimes not, but we don't know why (?? bad local. intrinsic characteristics, etc.) Figyres are unreadable.
Stupid Q; They use ReLU as rectifiier but AFAIK this is not very smooth and GD needs smooth functions. It's like making a silk purse from a pig's ear?

To T4A.
This article was written with no concern about issues that pernikity mathematicians worry about. Much waste of energy.Thiis is why mathematics is needed: less trial and error.

Das ist nicht nur nicht richtig; es ist nicht einmal falsch!
Wolfgang Pauli
I think we can agree that trial-and-error is a very slow process especially in extremely high-dimension systems and that a single elegant mathematical proof could eliminate the need for untold quadrillions of CPU-years of numerical mucking about.

Where we seem to disagree is whether such math exists. NNs are an example of emergent systems in which the output dynamics are not obvious from the input structures. There are some systems where the best model of the system is the system itself and turning the crank is best strategy even if it seems inefficient. Whitehead & Russell thought that everything could be neatly proved and Godel and Turing showed it couldn't.

BTW, trial-and-error may be less inefficient than first appears. Darwinian evolution has found millions of impressive optima in a 4^150,000,000,000 search space of genomes despite a trial-and-error sample size that is an infinitesimal fraction of the search space size.
That's too high-falutin' Most mathematician ignore both Russel and Turing, so let's leave them in peace. It's a completely different issue here. And not relevant.

The issue is more fundamental. It's so hard to explain as I know it will be incorrectly interpreted. Basically, we want to know the conditions under which a solution exists before proceeding. This is what we learn as undergrads, yes?

Example/Quiz
Steepest descent: what are the assumptions for it to work?

The loss of a NN has no "solution", you need to get exact too. You can tune a NN to give a low loss which could translate to a high accuracy in finding tumors in medical images.

Quiz: what things does the loss depend on? That's the first thing you need to know. When a tumor detecting NN has a loss of 0.03 what can that number be/mean, what ingredients does it have, and how would you compute it?

Traden4Alpha · November 17th, 2017, 5:36 pm

Let's try another example: exponentially fitted methods for the Black Scholes PDE and other linear convection-diffusion-reaction PDE are stable for any values of drift and diffusion. Standard FDM fail when convection dominance kicks. We can prove this without having to write a single line of code.

Now, has AI classified input types/categories so that having done that you know what to expect and which methods work? The bespoke arx files seem to suggest not. DL is only a few years old so miracles take longer. Be carefui with hype.

BTW here is a great example of constructivist mathematics

https://en.wikipedia.org/wiki/Banach_fi ... nt_theorem

Maybe this is an eye-opener
https://en.wikipedia.org/wiki/Construct ... thematics)

Why beat one's head against a mathematical wall of trying to find a proof of a solution if said proof does not exist?
It's the future of humanity Jim what's at stake here: Newton did it, Stokes did it, so AI should do it.

We are 100% agreed in the value of math for proving stuff. It's the same power that's found in the true engineering fields -- one can design a product and know exactly how it will perform without spending a penny making stuff.

We are agreed that AI does not have that. And we all agree that it would be really really good if it did. I think the show-stopper disagreement is in whether it's possible and tractable.

Note: there's another failure mode for math that is potentially latent in this problem. Even if one finds a proof that predicts which input systems are learnable by which AI methods, there's no guarantee that the proof encodes a simple calculation of existence or robustness. In seeking a proof of existence of a solution, we assume that the proof exists and that the phase space for well-behaved and ill-behaved systems is simple and easily calculated. Calculating whether a given AI method will work in a given context may well be harder than just running the AI method (Wolfram's point about irreducible complexity).

As much as I truly love the deductive power of math and engineering, I also know that sometimes trial-and-error is the best and cheapest method.

outrun · November 17th, 2017, 5:44 pm

Let's try another example: exponentially fitted methods for the Black Scholes PDE and other linear convection-diffusion-reaction PDE are stable for any values of drift and diffusion. Standard FDM fail when convection dominance kicks. We can prove this without having to write a single line of code.

Now, has AI classified input types/categories so that having done that you know what to expect and which methods work? The bespoke arx files seem to suggest not. DL is only a few years old so miracles take longer. Be carefui with hype.

BTW here is a great example of constructivist mathematics

https://en.wikipedia.org/wiki/Banach_fi ... nt_theorem

Maybe this is an eye-opener
https://en.wikipedia.org/wiki/Construct ... thematics)

Why beat one's head against a mathematical wall of trying to find a proof of a solution if said proof does not exist?
It's the future of humanity Jim what's at stake here: Newton did it, Stokes did it, so AI should do it.
We are 100% agreed in the value of math for proving stuff. It's the same power that's found in the true engineering fields -- one can design a product and know exactly how it will perform without spending a penny making stuff.

We are agreed that AI does not have that. And we all agree that it would be really really good if it did. I think the show-stopper disagreement is in whether it's possible and tractable.

Note: there's another failure mode for math that is potentially latent in this problem. Even if one finds a proof that predicts which input systems are learnable by which AI methods, there's no guarantee that the proof encodes a simple calculation of existence or robustness. In seeking a proof of existence of a solution, we assume that the proof exists and that the phase space for well-behaved and ill-behaved systems is simple and easily calculated. Calculating whether a given AI method will work in a given context may well be harder than just running the AI method (Wolfram's point about irreducible complexity).

As much as I truly love the deductive power of math and engineering, I also know that sometimes trial-and-error is the best and cheapest method.

To add to that: finding the global minimum in a 2 layer NN is NP complete.

... but the quantum computers are coming!

Edit: it's a good excersise to try and proof that it's NP complete.

Cuchulainn · November 17th, 2017, 7:08 pm

. but the quantum computers are coming!
2018, you said.

Cuchulainn · November 17th, 2017, 7:10 pm

As much as I truly love the deductive power of math
That's a cliche. Here's what Halmos says.

Mathematics is not a deductive science — that's a cliché. When you try to prove a theorem, you don't just list the hypotheses, and then start to reason. What you do is trial and error, experimentation, guesswork. You want to find out what the facts are, and what you do is in that respect similar to what a laboratory technician does. Possibly philosophers would look on us mathematicians the same way as we look on the technicians, if they dared.

And Hadamard says this

The roots of creativity for Hadamard lie not in consciousness, but in the long unconscious work of incubation, and in the unconscious aesthetic selection of ideas that thereby pass into consciousness. His discussion of this process comprises a wide range of topics, including the use of mental images or symbols, visualized or auditory words, "meaningless" words, logic, and intuition. Among the important documents collected is a letter from Albert Einstein analyzing his own mechanism of thought.

A clever graduate student could teach Fourier something new, but surely no one claims that he could teach Archimedes to reason better.

If you are bored with Deep Networks

Re: If you are bored with Deep Networks

Re: If you are bored with Deep Networks

Re: If you are bored with Deep Networks

Re: If you are bored with Deep Networks

Re: If you are bored with Deep Networks

Re: If you are bored with Deep Networks

Re: If you are bored with Deep Networks

Re: If you are bored with Deep Networks

Re: If you are bored with Deep Networks

Re: If you are bored with Deep Networks

Re: If you are bored with Deep Networks

Re: If you are bored with Deep Networks

Re: If you are bored with Deep Networks

Re: If you are bored with Deep Networks

Re: If you are bored with Deep Networks