[$]\text{dX} (\tau) = \mu (X(\tau)) \text{d$\tau $}+ \sigma(X(\tau)) \text{dz}( \tau) [$] Equation(1)

The formal solution of the above SDE is given as

[$]X(t)=X(t_0)+\int_{t_0}^t \mu (X(\tau)) \, d\tau+\int _{t_0}^t\sigma(X(\tau)) dz(\tau )[$] Equation(2)

We want to solve both integrand terms in above equation as

[$] \mu (X(\tau))=\mu (X(t_0))+\int_{t_0}^\tau d\mu (X(s))[$] Equation(3)

and

[$] \sigma (X(\tau))=\sigma (X(t_0))+\int_{t_0}^\tau d\sigma (X(s))[$] Equation(4)

When we substitute above equation(3) and equation(4) in equation(2), we get

[$]X(t)=X(t_0)+\int_{t_0}^t \mu (X(t_0)) \, d\tau+\int_{t_0}^t \int_{t_0}^\tau d\mu (X(s)) d\tau [$]

[$]+\int _{t_0}^t\sigma(X(t_0)) dz(\tau )+\int _{t_0}^t \int_{t_0}^\tau d\sigma (X(s)) dz(\tau )[$] Equation(5)

From Ito expansion of [$]d\mu (X(s)) [$] and [$]d\sigma (X(s)) [$], we know that

[$]d\mu (X(s))=\frac{\partial }{\partial X}[\mu ( X(s))]\mu (X(s)) \text{ds}+\frac{\partial }{\partial X}[\mu (X(s))]\sigma (X(s)) \text{dz(s)}[$]

[$]+0.5 \frac{\partial ^2[\mu ( X(s))]}{\partial X^2} \sigma ( X(s))^2\text{ds}[$] Equation(6)

and

[$]d\sigma ( X(s))=\frac{\partial }{\partial X}[\sigma ( X(s))]\mu (X(s)) \text{ds}+\frac{\partial }{\partial X}[\sigma (X(s))]\sigma (X(s)) \text{dz(s)}[$]

[$]+0.5 \frac{\partial ^2[\sigma ( X(s))]}{\partial X^2} \sigma ( X(s))^2\text{ds}[$] Equation(7)

We substitute equation (6) and Equation(7) in equation(5)

[$]X(t)=X(t_0)+\int_{t_0}^t \mu (X(t_0)) \, d\tau[$]

[$]+\int_{t_0}^t \int_{t_0}^\tau \frac{\partial }{\partial X}[\mu ( X(s))]\mu (X(s)) \text{ds} d\tau+\int_{t_0}^t \int_{t_0}^\tau \frac{\partial }{\partial X}[\mu (X(s))]\sigma (X(s)) \text{dz(s)} d\tau[$]

[$]+\int_{t_0}^t \int_{t_0}^\tau 0.5 \frac{\partial ^2[\mu ( X(s))]}{\partial X^2} \sigma ( X(s))^2\text{ds} d\tau[$]

[$]+\int _{t_0}^t\sigma(X(t_0)) dz(\tau )[$]

[$]+\int _{t_0}^t \int_{t_0}^\tau \frac{\partial }{\partial X}[\sigma ( X(s))]\mu (X(s)) \text{ds} dz(\tau )[$]

[$]+\int _{t_0}^t \int_{t_0}^\tau \frac{\partial }{\partial X}[\sigma ( X(s))]\sigma (X(s)) dz(s ) dz(\tau )[$]

[$]+\int_{t_0}^t \int_{t_0}^\tau 0.5 \frac{\partial ^2[\sigma ( X(s))]}{\partial X^2} \sigma ( X(s))^2\text{ds} dz(\tau )[$] Equation(8)

In a further order expansion, we evaluate terms at time 's' at [$]t_0[$] and add a third order integral. I am not showing the third order integral but up to second order, we will have

[$]X(t)=X(t_0)+\int_{t_0}^t \mu (X(t_0)) \, d\tau[$]

[$]+\int_{t_0}^t \int_{t_0}^\tau \frac{\partial }{\partial X}[\mu ( X(t_0))]\mu (X(t_0)) \text{ds} d\tau+\int_{t_0}^t \int_{t_0}^\tau \frac{\partial }{\partial X}[\mu (X(t_0))]\sigma (X(t_0)) \text{dz(s)} d\tau[$]

[$]+\int_{t_0}^t \int_{t_0}^\tau 0.5 \frac{\partial ^2[\mu ( X(t_0))]}{\partial X^2} \sigma ( X(t_0))^2\text{ds} d\tau[$]

[$]+\int _{t_0}^t\sigma(X(t_0)) dz(\tau )[$]

[$]+\int _{t_0}^t \int_{t_0}^\tau \frac{\partial }{\partial X}[\sigma ( X(t_0))]\mu (X(t_0)) \text{ds} dz(\tau )[$]

[$]+\int _{t_0}^t \int_{t_0}^\tau \frac{\partial }{\partial X}[\sigma ( X(t_0))]\sigma (X(t_0)) dz(s ) dz(\tau )[$]

[$]+\int_{t_0}^t \int_{t_0}^\tau 0.5 \frac{\partial ^2[\sigma ( X(t_0))]}{\partial X^2} \sigma ( X(t_0))^2\text{ds} dz(\tau )[$] Equation(9)

Since all the terms are evaluated at [$]t_0[$], they are constants and we can write the above integrals as

[$]X(t)=X(t_0)+\mu (X(t_0)) \int_{t_0}^t d\tau[$]

[$]+\frac{\partial }{\partial X}[\mu ( X(t_0))]\mu (X(t_0)) \int_{t_0}^t \int_{t_0}^\tau \text{ds} d\tau[$]

[$]+ \frac{\partial }{\partial X}[\mu (X(t_0))]\sigma (X(t_0)) \int_{t_0}^t \int_{t_0}^\tau \text{dz(s)} d\tau[$]

[$]+0.5 \frac{\partial ^2[\mu ( X(t_0))]}{\partial X^2} \sigma ( X(t_0))^2 \int_{t_0}^t \int_{t_0}^\tau \text{ds} d\tau[$]

[$]+\sigma(X(t_0)) \int _{t_0}^t dz(\tau )[$]

[$]+\frac{\partial }{\partial X}[\sigma ( X(t_0))]\mu (X(t_0)) \int _{t_0}^t \int_{t_0}^\tau \text{ds} dz(\tau )[$]

[$]+\frac{\partial }{\partial X}[\sigma ( X(t_0))]\sigma (X(t_0))\int _{t_0}^t \int_{t_0}^\tau dz(s ) dz(\tau )[$]

[$]+0.5 \frac{\partial ^2[\sigma ( X(t_0))]}{\partial X^2} \sigma ( X(t_0))^2\int_{t_0}^t \int_{t_0}^\tau \text{ds} dz(\tau )[$] Equation(10)

The integrals of the kind [$]\int_{t_0}^t \int_{t_0}^\tau \text{ds} d\tau[$], [$]\int_{t_0}^t \int_{t_0}^\tau \text{dz(s)} d\tau[$] and [$]\int _{t_0}^t \int_{t_0}^\tau dz(s ) dz(\tau )[$] and other higher order integrals can be very easily analytically solved for monte carlo. These integrals also commute. I will post a monte carlo simulation code for CEV and lognormal process based on the above logic in a few days.

If you like my above posts, I am looking for consulting and contract work and you could contact me for that at my email anan2999(at)yahoo(dot)com

Statistics: Posted by Amin — Yesterday, 10:48 am

]]>

]]>

https://www.unilad.co.uk/technology/jer ... -mistakes/

Statistics: Posted by Cuchulainn — November 22nd, 2017, 12:19 pm

]]>

outrun wrote:Traden4Alpha wrote:Do believe ODE/PDE are required for understanding DL or might it be just one lens among many?

I think it's very import what Cuch is working on, nobody has ever thought about the basic math

I am trying fist of all to learn DL (it is very interesting), so i try to connect up to the matter from my own background. So, it might come across as being awkward but I wish to flag each assumption.

Fair enough. There is of course *lots* material you can read about any narrow aspect in the field

IMO you're too much focusing on loss minimization without picking up on the fact that the goal is never to find the absolute minimum of the loss, you ingnore the statistical side.

The field is called *statistical* machine learning because you want to learn models from finite data sets.You have a finite set of high dimensional {x,y} pairs with unknown relation between them. The goals if to find a function y_ = f(x,theta) that predict y best given x. You have to pick a functional form for f and then find the theta. f could be a line and theta slope + inception without changing the argument. If you use DL then the functional form is a NN with multiple layers, you have to pick the topology, activation functions etc. For finding theta you can use all sort of methods, including trying out random theta values and see which give the best result. That works really well if you surface is very rough. In DL people try to find good activation functions that shape the loss function to make it work nicely with GD. It's a computational metric: smooth is good, fast gradient computation is good. Thing are pulled together intelligently: if you aren't allowed to use GD (or can't) then you pick a different type of f or loss function and make it perform well under those constraints. E.g. for some functions you can't specify a loss function.

The point where "statistical" comes is in that you have a finite set of samples. A large network would be able to memorize all samples which would give zero loss, the ultimate lowest point in the loss surface! But you don't want to memorize, you want to general structure in the data that links x to y and use that to find good y' values when presented with *new* {x'} samples. When training a NN you always monitor generalisation vs memorizing. The stop condition depends mostly on that and not much on the loss property. This is also one of the reasons people use stochastic gradient descent instead of full gradient descent (this has analogies to bootstap aggregation): by changing the subset of {x,y} pair in every gradient step you change the loss surface in every step. This helps the NN not getting anal on 'the one and true loss function that it wants to memorize", it's helps conveying the fact that you have uncertaintly in the loss surface dues to sample noise (finite set of training samples). The other reason is computational constraints: you have GPUs with limited memory and a large dataset you can't load into the GPU memory so you process data in small batches.

For me the most interesting research area at the moment is the information theoretical angle on learning. When you train a Deep NN it need to do two things: learn to extract features from x and then use those features to predict y. A recent paper triggered lots of discussion: https://arxiv.org/abs/1703.00810 and I think that innovation wrt training will come from it.

ps: for GD the biggest issue is saddlepoints, not local minima. In high dimensions you'll find much less locations where all the gradient have the same sign than places where the gradients have different signs. All sorts of classical tricks are applied to try and tackle that: adding noise, adding momentum,..

Statistics: Posted by outrun — November 22nd, 2017, 8:53 am

]]>

Traden4Alpha wrote:Cuchulainn wrote:Wow. More is less? Don't see any ODE, DL/PDE in there.

Do believe ODE/PDE are required for understanding DL or might it be just one lens among many?

I think it's very import what Cuch is working on, nobody has ever thought about the basic math

I am trying fist of all to learn DL (it is very interesting), so i try to connect up to the matter from my own background. So, it might come across as being awkward but I wish to flag each assumption.

Statistics: Posted by Cuchulainn — November 22nd, 2017, 7:28 am

]]>

Cuchulainn wrote:Are ODEs any good for solving simultaneous equations? (It may be a stupid question. I don't know the answer to that because I've never looked.)

This question has been posed and solved on my famous and extremely popular [$]e^5[$] thread. It is one of the 19 solutions. Have you forgotten? I posted a link already.

(It's a wee bit silly, that question, not because it is stupid (which it is not), but it is more of a lightning deflector).

Now that you mention it ...

More generally, you canembeda nonlinear system f(x) = 0 in ahomotopy/continuation

H(x,t) = (1-t) { f(x) - f(x_0)} + t f(x), t in [0,1].

Differentiate WRT t to get an ODE! Woopsie-daisy. Very robust way to solve NL equations.

So, the answer is yes. It is even possible to perform optimisation using homotopy.

Very interesting.

But does it translate to simultaneous equations involving minimization rather than equality?

For linear systems this homotopy could be used but it is overkill I suppose, For overdetermined or ill-posed systems least squares is good. Use DE to avoid gradients?

Statistics: Posted by Cuchulainn — November 22nd, 2017, 7:20 am

]]>

Traden4Alpha wrote:Cuchulainn wrote:Wow. More is less? Don't see any ODE, DL/PDE in there.

Do believe ODE/PDE are required for understanding DL or might it be just one lens among many?

I think it's very import what Cuch is working on, nobody has every thought about the basic math

I agree that his goals are very laudable and I do hope his methods bear fruit. Alas, I fear that his methods will either lead nowhere (the proof he seeks does not exist) or they will require simplification of the DL algorithms that have been superseded by current successful algorithms or require restrictions in the underlying learned system that are not provable from the data and that merely displace the robustness question to that layer.

But I sincerely hope he proves me wrong!

Statistics: Posted by Traden4Alpha — November 21st, 2017, 11:53 pm

]]>

Cuchulainn wrote:outrun wrote:For those who are also getting a bit bored with all this crazyness, the NIPS 2017 submissions are in: https://papers.nips.cc/book/advances-in ... ms-30-2017

SELU is already in Tensorflow 1.4

Wow. More is less? Don't see any ODE, DL/PDE in there.

Do believe ODE/PDE are required for understanding DL or might it be just one lens among many?

I think it's very import what Cuch is working on, nobody has ever thought about the basic math

Statistics: Posted by outrun — November 21st, 2017, 10:12 pm

]]>

outrun wrote:For those who are also getting a bit bored with all this crazyness, the NIPS 2017 submissions are in: https://papers.nips.cc/book/advances-in ... ms-30-2017

SELU is already in Tensorflow 1.4

Wow. More is less? Don't see any ODE, DL/PDE in there.

Do believe ODE/PDE are required for understanding DL or might it be just one lens among many?

Statistics: Posted by Traden4Alpha — November 21st, 2017, 10:09 pm

]]>

outrun wrote:For those who are also getting a bit bored with all this crazyness, the NIPS 2017 submissions are in: https://papers.nips.cc/book/advances-in ... ms-30-2017

SELU is already in Tensorflow 1.4

Wow. More is less? Don't see any ODE, DL/PDE in there.

Don't let that stop you, go for it

Statistics: Posted by outrun — November 21st, 2017, 10:08 pm

]]>

For those who are also getting a bit bored with all this crazyness, the NIPS 2017 submissions are in: https://papers.nips.cc/book/advances-in ... ms-30-2017

SELU is already in Tensorflow 1.4

Wow. More is less? Don't see any ODE, DL/PDE in there.

Statistics: Posted by Cuchulainn — November 21st, 2017, 9:52 pm

]]>

SELU is already in Tensorflow 1.4

Statistics: Posted by outrun — November 21st, 2017, 8:10 pm

]]>

outrun wrote:In theory, one can map the loss function through any monotonic function which would modulate all the slopes in the function as well as the differential heights/depths of local minima, saddles, etc. without changing the rank-order of those surface features.Another cool idea is to try and flip the sign of the loss function and use a maximixer in stead of a minimizer!

Yes, that's why *log* likelihood optimization is equivalent.

Statistics: Posted by outrun — November 21st, 2017, 8:04 pm

]]>

Another cool idea is to try and flip the sign of the loss function and use a maximixer in stead of a minimizer!

Statistics: Posted by Traden4Alpha — November 21st, 2017, 6:46 pm

]]>

This question has been posed and solved on my famous and extremely popular [$]e^5[$] thread. It is one of the 19 solutions. Have you forgotten? I posted a link already.

(It's a wee bit silly, that question, not because it is stupid (which it is not), but it is more of a lightning deflector).

Now that you mention it ...

More generally, you can

H(x,t) = (1-t) { f(x) - f(x_0)} + t f(x), t in [0,1].

Differentiate WRT t to get an ODE! Woopsie-daisy. Very robust way to solve NL equations.

So, the answer is yes. It is even possible to perform optimisation using homotopy.

But does it translate to simultaneous equations involving minimization rather than equality?

Statistics: Posted by Traden4Alpha — November 21st, 2017, 5:21 pm

]]>