If you are bored with Deep Networks

Cuchulainn · March 27th, 2018, 9:51 am

History of the tools

Logistic regression — 1958
Hidden Markov Model — 1960
Stochastic gradient descent — 1960
Support Vector Machine — 1963
k-nearest neighbors — 1967
Artificial Neural Networks — 1975
Expectation Maximization — 1977
Decision tree — 1986
Q-learning — 1989
Random forest — 1995

Cuchulainn · April 21st, 2018, 12:57 pm

I find trying to grasp all these articles a bit painful. Lots of theory/theorems etc. but where is the example? e.g. that you can check against.

Traden4Alpha · April 21st, 2018, 4:37 pm

I find trying to grasp all these articles a bit painful. Lots of theory/theorems etc. but where is the example? e.g. that you can check against.

This does not sound right to me. My (limited) experience of math has been the opposite. Examples are mostly for applied math.

I wonder what percentage of academic math papers have examples?

Cuchulainn · April 21st, 2018, 5:07 pm

Maybe this example helps.(sections 3 and 5)
https://en.wikipedia.org/wiki/Metric_space

'Concrete' can be at many 'levels' e.,g. objects and types.
Just think about metrics for NNs, input representation and the mappings (hopefurrry bijective) between metric spaces etc.

Cuchulainn · June 6th, 2018, 12:20 pm

Some open research issues(??)
https://pdfs.semanticscholar.org/a2cf/2 ... 9e43dc.pdf

and some guidelines ..

https://www.ucl.ac.uk/~ucfamus/papers/i ... ions17.pdf

It would be useful for the august panel to give their feedback.

Cuchulainn · June 7th, 2018, 4:35 pm

The 1st article is very old (from 2007). A lot of the stuff in it is out of date or applies to things people are no longer really interested in.

The 2nd article (guidelines) is kind of "Captain Obvious speaking" paper. The author's right about everything (or maybe almost everything, but I didn't have the time to go nitpicking), but it's rather well-known stuff.

Fair enough.So things have progressed in leaps and bounds in the period 2007-2017?
2007 is not very old. BP is at least 50 years old.

Cuchulainn · June 7th, 2018, 7:08 pm

The recent spate of articles are great on description and narrative (the what) but fall short on explanation (the how), as Wittgenstein might say.

Reverse engineering is well-nigh impossible. It's all black box anno 2007 or has that been resolved, ISayMoo?

Does university education not teach how to write unambiguous algorithms?

Cuchulainn · June 7th, 2018, 8:13 pm

The 1st article is very old (from 2007). A lot of the stuff in it is out of date or applies to things people are no longer really interested in.

The 2nd article (guidelines) is kind of "Captain Obvious speaking" paper. The author's right about everything (or maybe almost everything, but I didn't have the time to go nitpicking), but it's rather well-known stuff.
Fair enough.So things have progressed in leaps and bounds in the period 2007-2017?
2007 is not very old. BP is at least 50 years old.
Yes, things have progressed quite a bit since 2007.

You're beginning to sound like a politician. And I am not joking.

Cuchulainn · June 8th, 2018, 12:20 pm

Thank you. Clear.

katastrofa · June 9th, 2018, 11:08 am

You should like this paper: Do CIFAR-10 Classifiers Generalize to CIFAR-10?

(tl&dr: Deep Learning image classifiers break under an even minor shift in the data distribution.)

Could it be overcome by introducing a translation vector as an extra training parameter?

outrun · June 9th, 2018, 3:19 pm

You should like this paper: Do CIFAR-10 Classifiers Generalize to CIFAR-10?

(tl&dr: Deep Learning image classifiers break under an even minor shift in the data distribution.)
Could it be overcome by introducing a translation vector as an extra training parameter?

That's a popular technique called "data augmentation"

In general (no matter what technique you use) there is of course always an issue with image classificatioin that you have a relative small set of sample in a very high dimension. How to generalize?

My intuition is techniques borrowed from random forests and support vector machines (and many GANs) will improve the robustness of generalization.

I've also read that the new test samples generates in that paper are easy to classify (by a human) as different than the original cfar-10 training samples. If so then what does that say? Imo it's actually good that models performs worse on the new samples and at the same time that the original samples weren't representative for the task you want the model to perform.

It's still a good point that the paper is making, I like it.

katastrofa · June 9th, 2018, 4:15 pm

More precisely: at the training stage, I would draw a minibatch of pictures and transform each of them by translating it by a vector in plane and rotating. I would calculate the 3D loss function for each transformation and look for its minimum. It cannot be any minimum though - it needs to be a smooth bump rather than a sharp peak. The procedure sounds cumbersome, but there are many statistical methods which would accelerate it. There's a risk that the network would start to confuse objects which e.g. look similar to different objects if one rotates them, but there's a chance that the network would reach for finer differences to successfully train itself.

Apart from translation and rotation, one could consider other transformations, such as asymmetric scaling to manipulate the perspective. This would be much more challenging, but could potentially teach the network to recognise the same object seen at different angles.

Anyway, I would try this out myself but I don't have sufficient computational power

(If I had, I would use for something more interesting anyway

)

Cuchulainn · June 10th, 2018, 9:41 am

You should like this paper: Do CIFAR-10 Classifiers Generalize to CIFAR-10?

(tl&dr: Deep Learning image classifiers break under an even minor shift in the data distribution.)
Could it be overcome by introducing a translation vector as an extra training parameter?

This probably means that the problem is "ill-posed" in some way. The case of adversarial training is probably a good example. So, adding a very small increment to the image result in an image of a gibbon instead of a panda. In numerical analysis this means that the maths that processes the image is not stable. Maybe it's different with ML but there is something not kosher. Correct me if I am wrong.(gut feeling says the metrics used to compute distance is not robust enough).,

On {over,under}fitting, it seems they are introduced in the same breath as global polynomials, scary beast at the best of times. The fix seems to be to use regularisation (aka penalty, Lagrange multipliers), which is a Pandora's box of optimisation. It can open a new can of worms e.g. which one to use?

In the book by Geron page 123 he talks about a 300-degree polynomial regression model. Is it serious? No one in numerical analysis does this.

// finding the 'closeness' of 2 data distributions? Which (semi-)metric is used??

outrun · June 10th, 2018, 3:15 pm

Data augmentation is adding a-priory knowledge via extending your dataset. If you have a labeled picture of a cat then you know that you can shift it one pixel to the right and have another cat image (but if it was picture of a QR code pointing to a Spanish domain then not!). Giving more sample created by leveraging known invariants during training helps generalize, teach the model about those invariants through samples.

If I trained some model to distinguish cat from dogs based on a small set of pictures, ..and then later revealed that it was actually about black vs brown objects instead of cats vs dogs then there is no way the model couldn't have learned that from the labeled train data. It could have learned either interpretation of the objective, both would for the train data. Even worse, the cat pictures might have been taken with a flash and it might have learned to detect flashes. The model is as good as the data you provide, and it needs a lot of data (or a priori knowledge) to be able to correctly attribute labels to features and generalize well. These are general statistical / information problems, not specific to deep learning.

There are of course other ways to add a priority knowledge than creating extra samples. Eg CNN assume that the pixels have some 2d layout and that you should use hierarchical features. That's also a form of adding a-priory knowledge to your model. In the 90s there were some experiment with rotation, scale, shift invariants..

Cuchulainn · June 10th, 2018, 6:48 pm

Adversarial examples exist also for "old school" classifiers such as SVMs. It's not a problem with deep learning only.

Those methods seems to have some commonality.

If you are bored with Deep Networks

Re: If you are bored with Deep Networks

Re: If you are bored with Deep Networks

Re: If you are bored with Deep Networks

Re: If you are bored with Deep Networks

Re: If you are bored with Deep Networks

Re: If you are bored with Deep Networks

Re: If you are bored with Deep Networks

Re: If you are bored with Deep Networks

Re: If you are bored with Deep Networks

Re: If you are bored with Deep Networks

Re: If you are bored with Deep Networks

Re: If you are bored with Deep Networks

Re: If you are bored with Deep Networks

Re: If you are bored with Deep Networks

Re: If you are bored with Deep Networks