Or perhaps this angle:

Suppose you fit one of your methods to the ozone data. Before going into model choice: what possible purpose would be served by fitting a function through the data?

Or perhaps this angle:

Suppose you fit one of your methods to the ozone data. Before going into model choice: what possible purpose would be served by fitting a function through the data?

Suppose you fit one of your methods to the ozone data. Before going into model choice: what possible purpose would be served by fitting a function through the data?

- Traden4Alpha
**Posts:**23951**Joined:**

Cuchulainn wrote:Picking a curve fit method seems to involve a combination of:1. Selecting the category of the approximating curve (e.g., a global polynomial, chain of splines, sinusoids, wavelets, multi-population model with independent curves, etc., etc

So, what's the answer? What are the criteria that lead to our choosing method 1 over method 2?

1) Metadata: Knowledge about the source (or application) of the data might include knowledge of likely mathematical properties such as form, derivatives, continuity, etc. of the underlying data generation phenomenon. That might lead to selecting a global versus a local curve fit (e.g., fitting a piece-wise description of a road network to GPS data points).

2) Heuristic Evaluation: eye-balling the shape of the scattergram for evidence of locality, periodicity, nonlinearity, etc.

3) Computational cost: Development expediency (e.g., using Excel's polynomial curve fit) or data throughput issues (e.g., using discrete cosines to fit image data) might drive curve fit method selection.

4) Minimizing an Error cost function: One might try multiple methods and pick the one with the least error (as "error" is defined for the problem).

Cuchulainn wrote:Let me try in this way by a question: what is the rationale/reason for using global polynomials in AI? I did look in a few books but no reason was given. I suppose someone can come out and give an answer.

Global polynomials are not used in most applications of numerical analysis these days AFAIK. It is pre-1960s technique. In fairness, maybe AI is dealing with other issues.

That remark about using a 300-degree polynomial was hilarious. I hope you were joking. Was the example taken from Geron's book?

The probable reason for using global polynomials is an appeal to Taylor series and their ability to fit a great many types of functions (but not all!); computational simplicity; internal encoding of the derivatives of the curve; and familiarity to those who know basic maths (algebra & calculus).

Is a 300-degree polynomial any more ridiculous than using a million or more sine-wave segments?

Just to stop confusion and wild stories: my remark about fitting a 300 degree polynomial to a trillion point dataset has nothing to do with AI. Nothing I said has.

- katastrofa
**Posts:**6139**Joined:****Location:**Alpha Centauri

I sometimes get an impression that the key concept of machine learning is to replace thinking with computational power. Most basic methods are prone to severe overfitting - they even invite it. Not to mention stacked models and other crazy things, which hoodwink the user from the danger of even more overfitting.

Allowing overfitting is like assuming that the modelled system is a closed system (or functionally equivalent), namely all that can happen with it is described by the information it contains. Many model selection methods are based on this rationale too, e.g. AIC. By overfitting the data one assumes that the information they carry is generated solely by the system degrees of freedom, while - in reality - part of it can come from an external (usually stochastic) environment. Overfitting represses this unpredictable factor, while in many cases it is crucial. That's why I doubt ML will be adapted to fragile problems involving forecasting and risk anytime soon. I would envision that while the majority chases for computational speed and efficiency of ML algos, serious machine learning will soon look to move on a D-wave or some other quantum systems to account for the stochastic environment.

Going back to trend detection, I used a very simple method in the end (my purpose was an algorithm detecting trends as I can see them with the nakd eye). I run the linear regression on all possible continuous groups of points and choose the best model with AIC or BIC.

You can play with it here: https://www.averisera.uk/machine-learning-demo.html#linear-trends

(There are also the k-means clustering demo and its application to trend detection above.)

The Javascript code is on the webpage, but if you want to use it quickly, I can either give you the files or add a file upload to the webpage, so that you can test it on your data (I'm afraid it won't handle trillions of data points though).

(The website is a bit of an embarrassment - I never have time to finish/update it )

All right ramblers, enough rambling

Allowing overfitting is like assuming that the modelled system is a closed system (or functionally equivalent), namely all that can happen with it is described by the information it contains. Many model selection methods are based on this rationale too, e.g. AIC. By overfitting the data one assumes that the information they carry is generated solely by the system degrees of freedom, while - in reality - part of it can come from an external (usually stochastic) environment. Overfitting represses this unpredictable factor, while in many cases it is crucial. That's why I doubt ML will be adapted to fragile problems involving forecasting and risk anytime soon. I would envision that while the majority chases for computational speed and efficiency of ML algos, serious machine learning will soon look to move on a D-wave or some other quantum systems to account for the stochastic environment.

Going back to trend detection, I used a very simple method in the end (my purpose was an algorithm detecting trends as I can see them with the nakd eye). I run the linear regression on all possible continuous groups of points and choose the best model with AIC or BIC.

You can play with it here: https://www.averisera.uk/machine-learning-demo.html#linear-trends

(There are also the k-means clustering demo and its application to trend detection above.)

The Javascript code is on the webpage, but if you want to use it quickly, I can either give you the files or add a file upload to the webpage, so that you can test it on your data (I'm afraid it won't handle trillions of data points though).

(The website is a bit of an embarrassment - I never have time to finish/update it )

All right ramblers, enough rambling

- Traden4Alpha
**Posts:**23951**Joined:**

Excellent points!

Yet the human eye (and brain) is among the worst overfitters in history. The penchant for overfitting seems to be one of the best and worst qualities of human cognition. People assume there is no noise -- there's only structure as created by physical phenomena or gods (and even gods don't play dice). It's both the driver of science and superstition. There's even math (Ramsey theory) to prove that patterns are guaranteed to occur where none exist.

You are right that naive applications of ML (or any sufficiently combinatoric statistical method) will overfit badly. The challenge is in adding additional methods that characterize the chance of an overfit, restrict the original process to a modest M-tries to fit N data points, or construct a prudent amount of out-of-sample testing.

Personally, I'd think that quantum computing will overfit even more so than traditional computing in that it potentially calculates something on every possible superimposed state. How would a D-wave machine handle accidental coincidence?

Yet the human eye (and brain) is among the worst overfitters in history. The penchant for overfitting seems to be one of the best and worst qualities of human cognition. People assume there is no noise -- there's only structure as created by physical phenomena or gods (and even gods don't play dice). It's both the driver of science and superstition. There's even math (Ramsey theory) to prove that patterns are guaranteed to occur where none exist.

You are right that naive applications of ML (or any sufficiently combinatoric statistical method) will overfit badly. The challenge is in adding additional methods that characterize the chance of an overfit, restrict the original process to a modest M-tries to fit N data points, or construct a prudent amount of out-of-sample testing.

Personally, I'd think that quantum computing will overfit even more so than traditional computing in that it potentially calculates something on every possible superimposed state. How would a D-wave machine handle accidental coincidence?

- katastrofa
**Posts:**6139**Joined:****Location:**Alpha Centauri

Traden4Alpha wrote:Excellent points!

Yet the human eye (and brain) is among the worst overfitters in history. The penchant for overfitting seems to be one of the best and worst qualities of human cognition. People assume there is no noise -- there's only structure as created by physical phenomena or gods (and even gods don't play dice). It's both the driver of science and superstition. There's even math (Ramsey theory) to prove that patterns are guaranteed to occur where none exist.

You are right that naive applications of ML (or any sufficiently combinatoric statistical method) will overfit badly. The challenge is in adding additional methods that characterize the chance of an overfit, restrict the original process to a modest M-tries to fit N data points, or construct a prudent amount of out-of-sample testing.

Personally, I'd think that quantum computing will overfit even more so than traditional computing in that it potentially calculates something on every possible superimposed state. How would a D-wave machine handle accidental coincidence?

How out-of-sample testing could reduce overfitting? TBH, I cannot see its any practical value for testing data analysis models: if my samples are representative, out-of-sample testing will be positive; if they are atypical, it will be negative. It says more about the data than my model.

In a quantum computer, the computations are performed via unitary transformations (there is no dissipation/no interaction with the superimposed state is allowed, otherwise the quantum superposition would be destroyed). This means that the states cannot be exploited as degrees of freedom. The QC working on a group of bits can find their mutual state immediately, since they act collectively thanks to quantum effects, instead of going through them one after another, as a a classical computer would do. The number of bits is the same though. (I'm going down to basics because I didn't understand that bit in your comment; what's accidental coincidence?.)

In D-wave, the coupled SQUIDs only "imitate" such quantum states, meaning that they are not isolated from the environment by design. This "flaw" allows it to e.g. find multiple ground states (different solution in each run), because random kicks from the environment can steer it to a different minimum. I would also imagine that the presence of environment, a.k.a. bath, can throw new light on old models, e.g. look at the ingenious visualisation of finding the shortest path between two points I prepared Without any bath, it's an interpolation problem - the solution is a straight line between my fingers. The bath changes the topology of the problem by allowing something that resembles virtual transition between stationary states and, from the system perspective, it becomes an extrapolation problem.

- Traden4Alpha
**Posts:**23951**Joined:**

Thanks for explanation (with clever props!)

If the number of states of the true (but unknown) system exceed the sample size, how can the data be "representative?" The data must be missing some portion of the true structure of the system. (That's certainly a serious problem in physics in which all experiments sample a very restricted subset of energy, velocity, distance, mass, etc.)

Or if data is affected by experimental error in the independent or dependent variables, how can said data be representative? Noisy data may encode impossible states (e.g., imagine a noisy measurement of the number of dots on a simple 6-sided die that produces the occasional 0 or 7 in the data.) Or the noisy data might, by chance, have some seemingly coherent pattern (region of spurious density or geometry).

I'd think the general condition in science is that any set of data must be assumed to be both under-sampling the full structure of the system and also contain some spurious structure induced by experimental error and measurement noise. In the most general case, one does not even know the magnitude or distribution of the noise. All one has is data.

Imagine a data set of 3 X-Y pairs. A quick plot might show the three points are not collinear Next, we put these bits of data into a QC curve fitter. How does the QC curve fitter decide whether the mutual state of this set of bits is reflective of a linear fit (with some residual error) or a parabolic fit (with zero residual error)? The linear fit might be correct or it might be an underfit. The parabolic fit might be correct or an over-fit. Or does the QC system spit out an answer that the system remains in a multi-state condition having both the linear-fit and parabolic fit solutions left?

If we add a fourth data point and rerun the analysis, the result is that fit of the linear solution might improve or degrade, the fit of the parabolic solution can only degrade (unless the data has zero noise), and the fourth data point might suggest the true system is a cubic (with a perfect fit to the 4 data points). Yet any changes in the results from the 3-point to the 4-point case might reflect what I called accidental coincidence in that whatever assessment one does of the chance that the true system is linear, parabolic, or cubic, there's the chance that experimental error has affected the results -- making the various fits seem better or worse than they are.

The point of out-of-sample testing is to cope with the all-to-common (flawed) approach to curve fitting that tries everything and thus effectively exceeds the DoF of the data. Testing any outcomes of exploratory, combinatoric statistical methods against "new data" is one way to detect whether over-fitting has happened.

-----

Perhaps what I don't understand (my ignorance of QC) is whether an N-bit QC can compute something that would be impossible to compute on a 2^N core parallel computer in constant time or perhaps even a single-core computer taking O(2^N) time? Are QCs a qualitatively different class of computer from a Turing universal computing machine?

If the number of states of the true (but unknown) system exceed the sample size, how can the data be "representative?" The data must be missing some portion of the true structure of the system. (That's certainly a serious problem in physics in which all experiments sample a very restricted subset of energy, velocity, distance, mass, etc.)

Or if data is affected by experimental error in the independent or dependent variables, how can said data be representative? Noisy data may encode impossible states (e.g., imagine a noisy measurement of the number of dots on a simple 6-sided die that produces the occasional 0 or 7 in the data.) Or the noisy data might, by chance, have some seemingly coherent pattern (region of spurious density or geometry).

I'd think the general condition in science is that any set of data must be assumed to be both under-sampling the full structure of the system and also contain some spurious structure induced by experimental error and measurement noise. In the most general case, one does not even know the magnitude or distribution of the noise. All one has is data.

Imagine a data set of 3 X-Y pairs. A quick plot might show the three points are not collinear Next, we put these bits of data into a QC curve fitter. How does the QC curve fitter decide whether the mutual state of this set of bits is reflective of a linear fit (with some residual error) or a parabolic fit (with zero residual error)? The linear fit might be correct or it might be an underfit. The parabolic fit might be correct or an over-fit. Or does the QC system spit out an answer that the system remains in a multi-state condition having both the linear-fit and parabolic fit solutions left?

If we add a fourth data point and rerun the analysis, the result is that fit of the linear solution might improve or degrade, the fit of the parabolic solution can only degrade (unless the data has zero noise), and the fourth data point might suggest the true system is a cubic (with a perfect fit to the 4 data points). Yet any changes in the results from the 3-point to the 4-point case might reflect what I called accidental coincidence in that whatever assessment one does of the chance that the true system is linear, parabolic, or cubic, there's the chance that experimental error has affected the results -- making the various fits seem better or worse than they are.

The point of out-of-sample testing is to cope with the all-to-common (flawed) approach to curve fitting that tries everything and thus effectively exceeds the DoF of the data. Testing any outcomes of exploratory, combinatoric statistical methods against "new data" is one way to detect whether over-fitting has happened.

-----

Perhaps what I don't understand (my ignorance of QC) is whether an N-bit QC can compute something that would be impossible to compute on a 2^N core parallel computer in constant time or perhaps even a single-core computer taking O(2^N) time? Are QCs a qualitatively different class of computer from a Turing universal computing machine?

- katastrofa
**Posts:**6139**Joined:****Location:**Alpha Centauri

I'm sorry I'm not responding to your posts, but I need to stay focused on something else for now (there's a danger that I would spend here hours chatting about physics and miss my deadline)

Still, QCM is almost here:

http://www-03.ibm.com/press/us/en/press ... /53374.wss

https://phys.org/news/2017-11-ibm-miles ... antum.html

Still, QCM is almost here:

http://www-03.ibm.com/press/us/en/press ... /53374.wss

https://phys.org/news/2017-11-ibm-miles ... antum.html

- Traden4Alpha
**Posts:**23951**Joined:**

No worries!

(And those are exciting advancements in QCM. I wonder if after a few years, QCM bit depth growth will be exponential or slower? When will there be megabit QCM?)

(And those are exciting advancements in QCM. I wonder if after a few years, QCM bit depth growth will be exponential or slower? When will there be megabit QCM?)