So how does the ML researcher ensure they've got good training data (and how many "unreliable" ML projects are the fault of the data not the ML)?
This is a mathematical issue: for which class of data is a given algorithm applicable? But does the 'average' ML researcher have the necessary background?
Aliter.. assuming the input is what it is, which algorithms will work with it? Ideally, you should know a-priori. You can't bend the laws of mathematics, e.g. taking the derivative of a discrete function is darn difficult.
Is it a mathematical issue?
What math (actually what algorithm) can determine that a given body of physical data belongs to a given mathematical class?
And, more importantly, isn't it an issue of science to determine whether a given physical system can be modeled by a particular class of mathematics? That decision is then prerequisite to determine whether a particular subset of data from that physical system is suitable for a particular ML algorithm.
aka the
Scientific Method or is it too early?
No, it's perfect timing. Yet to your point about math, the
Scientific Method doesn't provide much specific guidance on the relationships between the mathematical form of a theory, the mathematical nature of controls and interventions in an experiment, and the proper mathematical interpretation of the data. So one needs both science & math.
BTW, I had a coworker who preferred religion to science because he actually said that he hated how science was always changing it's mind! He preferred to believe a reliable falsehood than the unreliable truth! Perhaps ML is unreliable because our knowledge of the world that is used to collect data and apply ML is unreliable. It's the bathwater, not the baby that's dirty.