Science is the search of understanding for how things work. Its domain is what we know we do not know yet. Its language speaks of probabilities and talks interactions between parts of a whole — be it particles, compounds, tissues, organisms, and so on. What is currently understood, what can be known from our theories and models — that is the output of Science.
Science takes a certain kind of regular unknowns as an input, the kind that let themselves be repeatedly dissected, experimented with, and learnt from. From them, the gears of Science distill what can be known by observation, often producing surprising insights from seemingly uninteresting sources. As the gears move, they squash the uncertain terrain they process, denting it with new knowledge, redrawn boundaries, and illusory patterns. At each turn, if too many of these illusions come out of the noise, they will clog future gears and stall scientific progress.
The scientific pursuit, then, sits in the frontier of a constant unknown, an unknown that keeps moving, morphing, changing. To tame it, we sketch around it, delineating its function and drawing maps little by little so we do not get lost. Some maps are better than others. Some maps are deadly and some are useful. Some maps leave us better off if we just toss them away, and instead go for a blind but careful walk, stumbling upon obstacles until we can draw a better map anew.
Computer Science is a deep and wide field, and so is its map. Some of its areas of study are usually theoretical, working on idealized constructs to solve abstract problems. Automata Theory, Algorithmic Complexity or Computational Logic are examples of Computer Science domains in which theory reigns supreme. Other fields are more eminently applied, namely those that have to do with the physical devices, like Computer Architecture, or those which have an industrial bend, such as Software Engineering.
This divide is often blurry: people working on Algorithmic Complexity often have to cope with physical realities, such as the existence of memory caches and the implication they have in terms of memory complexity. Likewise, most Machine Learning algorithms are grounded on theoretical works that prove they can represent certain kinds of problems in one way or another, at least asymptotically.
However, it is still a useful distinction: few researchers on Complexity Theory will have to deal with the uncertainty of an author proposing a new kind of learning algorithm. Machine Learning algorithms should work with data in the wild, and that involves a scientific approach more resembling that of the physical sciences than pure mathematics. An algorithm that is proven to work in theory but fails in practice, even if its proofs are beatiful and perfectly correct, will likely see no use because its underlying assumptions are not fit for the real world. The opposite case, in which an algorithm works despite the reason why it works is not fully understood, will nevertheless see interest from academia and industry alike.
At this point, the question is: what do we mean by `working’? There is an obvious answer: the system does what it is supposed to, using reasonable amounts of resources, in a measurable way. This answer is somewhat quantitative: although the last two clauses in it are objective, the first one is not. The understanding of what a system should do is something that we will have to provide inasmuch as we do not have Artificial General Intelligence. When we clarify what we think something is supposed to do in a concrete way, we can go ahead and tackle the problem in both engineering and scientific terms.
For instance, a question answering system could do many different kinds of things. Deciding that it should be given a question and a text passage, and that the answer should be extracted from the text passage is just one way of making a question answering system work — one out of many! It is also a way that allows us to quantify how many resources we use and to measure the quality of the system. If we annotate passages from texts as answers to questions, we can build a ground-truth data set to both train and evaluate such a system. For instance we could compare the overall overlap of the answers the system provides and the answers we annotated, according to some metric. This is how SQuAD, a commonly used question answering data set for neural natural language processing, is designed.
By approaching the learning problem like this, it becomes empirical and experimental. Notice we will have a limited subset of all the universal data, whatever it is, and due to measurement or annotation errors we will always have a fuzzy picture of the whole. With question answering, we will always be able to ask new questions about new text passages — unless at some point we stop being imaginative! Our job is to draw a map that accounts for this uncertainty, to make sure that we avoid creating as many illusory patterns as possible with our research. The value of Science depends on the quality of its models and their predictions. If we fool ourselves, other researchers or society at large, we will lower the trust in Science as a method for figuring out how things work. How can be sure we are not fooling anyone, including ourselves, whether we intend it or not?
When we figure things out about how a system works, we expect the things we discover to remain true. The persistence of our results is important because after the hassle of experiments and analysis, we want the things we learn to both be insightful and lasting. Our experiments may show some regularity in the results because there is an underlying property that we are observing, but we also might get lucky because of the moment in time or our geographical position. Making sure that it is not Lady Luck the one driving our experiments equals not fooling ourselves.
In the wake of complex Machine Learning models with many parameters and ever-more-complex architectural search, many things can get lucky for the worse: Could the training algorithm reach a particular set of parameters that do very well on our validation or test sets, but not elsewhere? Does the data resemble the real world? How good are the proposed representations for slightly different problems? How good, if at all, for problems which are very different altogether?
Our questions highlight different senses of what reproducibility means. An initial sense of reproducibility concerns our own results and our belief that they capture certain measurable truth about our problem, all within some bound of uncertainty. This sense comes from the fact that our data is noisy, our training process might involve randomness by shuffling the data, and different executions produce different results. In this case our job is to first understand and then limit the amount of uncertainty:
Another sense holds the data, the metrics used to evaluate the model, and the overall task being tackled as suspect. If the underlying assumptions behind our work do not align with something tangible, if our work claims to solve one task while actually tackling a different one, if we have to dress up what we are doing to pass through the publication filter, it is likely we are fooling ourselves and others — willingly or not. Because it is hard work, work we have put our effort into, we have to be very wary and remain doubtful so that we avoid mistakes. Our job becomes to ensure that:
Finally, there is a sense of generality in reproducibility. This ties in with the first point we raised in the previous list: we often present models as performing a general task, while in fact only they only perform the task of ranking highly under a given metric and for a data set. While this is valuable by itself, if the task is a meaningful baseline, it goes against the idea of persistence in time. If we do things well, we will be able to replicate the results, but they might not be useful in a world with more compute power, larger amounts of data or changes in how the task at hand is solved.
General reproducibility is challenging. By general reproducibility we mean the idea that a proposed method is applicable to a class of problems, and that it performs well in that domain. This is hard: even for the same task, data sets are hardly comparable, and methods may require changes to be applied to all of them. Furthermore, in the case of mixed results across those data sets, there is no general way of weighting them to decide which model is preferable.
At this point, we could argue that a neural model with less parameters and smaller weights is better. However, ongoing research on architecture search yields highly convoluted architectures with very few parameters! The problem then repeats: how do we weight architectural complexity against number of parameters? What if our model combines several algorithms in an ensemble? Is the complexity the sum of the parts, or do we have to consider the software infrastructure that pulls it all together?
General reproducibility is a point in which it is easy to fool ourselves. It is not possible to ensure that our results will remain relevant for the ever-growing power of computers, their larger memories and the enormous data sets that they will be able to contain. But we can at least make sure they are as relevant as possible to the present, even if it shows our results are not that impressive. In the era of Science-as-a-publication-engine, this can mean that we have a harder time getting our results published. However, it also means we are doing our due diligence, ensuring we do not fool ourselves and our fellow scientists, engineers, and any other readers of our work alike. In short, we must ensure that:
We know that the map we want to draw is a complicated one. Machine Learning is at its core a tool that can be used for any discipline and field of study. We do not know how much further we can take techniques such as Deep Learning, and what tasks we can get those models to perform.
That we do not know how far we will get with our current methods is business as usual. After all, the way Science advances is by never overlooking its uncertainties. We will never be sure that our models are correct, and we can only hope that we will always be wise enough to revise them — more so when they become absolutes in our minds, in the minds of scientists and laymen alike. Our suggestions for scientific cartography do nothing but remind you of this fact: we must pay attention to flaws, simplifications, error and uncertainty. When we are working on techniques that may empower future analysis in many other fields, we must be additionally wary about those hobgoblins of research.
To look at the limitations of our models, to talk loudly, proudly, and clearly about what they cannot capture, to pay special attention to our errors and mistakes. This is the way to figure out the boundaries we will be sharing with our colleagues so they can uncover more secrets in fields of their own. Giving our fellow scientists a flawed tool whose flaws are hidden behind grandiose expositions and marketing hype is a recipe for clogging the machinery of Science.
I opened this text by saying that the aim of Science is to figure out how things work, at least any and all the things that can be measured. As Machine Learning becomes the field that allows us to disentangle obscure relationships that were before unthinkable, we feel the weight of every other field on our shoulders. If we are not aware of how much we do not know, if we do not communicate it, if we do not make our doubt a priority and allow others to check it, we will cost our colleagues and ourselves all the work and resources spent in our models being wrong. It is only through reproducible research that we will bear this weight on our shoulders, that we will be able to have giants standing on our backs. It will be hard work, but eventually our map will be incomplete, but drawn.