In our work applying machine learning to our customers' problems, we see the direct benefits we gain from the so called "democratization of AI". In Imaginea's business context, a specific interpretation of that phrase is that a small number of engineers will be involved in working on problems under the hood while a much larger number will be building intelligent systems on top of the tools produced by these engineers. This is how I think this industry will pan out over the coming years .... very quickly.

The way the software industry has expanded in the past several decades is quite similar - with a (relatively) small number of teams building core tools and libraries and a much larger populace gaining proficiency in building applications, products and services using them.

While we know that high quality tools like TensorFlow and PyTorch are making machine learning more widely accessible. The bigwigs Amazon, Google and Microsoft are also building easy to use ML-as-a-service offerings. Beyond these, I see a couple of recent developments that feel fundamental. They promise to change the ML game and begin a deep kind of democratization of AI beyond the known tooling.

Differentiable programming

Wikipedia has this to say about differential programming, or ∂P -

Differentiable programming, or ∂P, is a programming paradigm in
which the programs can be differentiated throughout, usually via
automatic differentiation. This allows for gradient
based optimization of parameters in the program, often via
gradient descent.

Widely available ∂P is, I think, going to be a game changer for developing smart systems. While this looks like a deeply mathematical development, it is simple to explain to a programmer as below.

A normal function that computes some output from given input.

The normal process of software development involves coding up a procedure to turn a known input into a desired output. The previous generation of "intelligent" systems often involved such procedures that captured large amounts of domain wisdom in parameterized rules. A procedure could be walking through a set of carefully chosen nested if-then-else conditions to decide a particular action or presentation, for instance. In other words, these procedures encode heuristics to solve a problem.

A parameterized function where parameters are to be calculated from known data.

While the current crop of ML systems are about replacing these heuristic procedures with functions that are learnt from available data, the availability of ∂P brings a new kind of programming for such intelligent systems. If you code up a parameterized procedure that shows how some input could be used to compute some output, then a ∂P system can automatically produce a procedure that can learn optimal values for these parameters from known data.

Auto-calculated gradients help estimate parameters from known INPUT-OUTPUT data.

I'm extrapolating a bit here, but it's not much. A ∂P system just automatically calculates the partial derivatives of any given function with respect to its inputs. A developer can use this to mechanically write a differentiable cost function using the real function of interest, pass its automatically calculated derivatives to a standard optimization package such as SGD, along with known INPUT-OUTPUT data and get learnt parameters that fit the data. This is possible because ∂P composes well - i.e. a coder can call other functions from within their function and the ∂P mechanism can figure out how to calculate derivatives right through the function calls. Automatic Differentiation, Dual/Taylor numbers and Higher Ranked Beings - may help understand how derivatives can be automatically calculated .. though those posts are not about actual production implementations.

"Training" model parameters through a comparator or "cost" function that produces a single score.

A programmer now no longer needs to learn a plethora of specific model types such as linear or logistic regression. She can use domain knowledge and basic math to code up a model procedure that can combine any number of these techniques, with the ∂P system in conjunction with the optimizer figuring out how to estimate the model's parameters from known data.

At a high level, this can feel like magic because writing a function to calculate some output from an input automatically also gives a way to estimate (some) input that will produce a given output. Instead of propagating the gradients to the parameters (there are none in this case), we propagate them back to the input in the above picture, with the output held at the desired value.

A great example of how well integrated ∂P can be within an otherwise conventional programming language is Zygote - a library for Julia (paper). ∂P is also coming to Swift. As more languages make unconstrained ∂P available, we should see more learned intelligence in the applications we build. We won't have to unroll derivatives by hand any more. Building online learning systems should also get easier.

Data programming (Snorkel)

While ML looks like a cool field to be in, the programs produced are rather pointless without adequate data. It is a public secret that a significant amount of the work on an ML project will involve working on ensuring a good and clean data set. Furthermore, many problems that can be solved using known ML techniques are often stalled due to inadequate data, or inadequate budget to produce and maintain the required datasets. For example, I estimate that the effort that goes into producing the datasets posted on Kaggle for competitions is many times bigger than what the prize money suggests.

A recent entrant that promises to change this game is the Snorkel tool by HazyResearch. Snorkel, Fonduer and the family of tools use an approach they refer to as "data programming" to alleviate the tedium involved in producing and maintaining high quality data sets while substantially reducing the costs of producing them.

The core approach of Snorkel is to use short programs referred to as "labeling functions" to roughly capture domain knowledge and patterns known to the team, and use statistical techniques to combine many such noisy labeling functions to produce datasets with higher quality labels. This technique of combining many poor predictors to make a stronger predictor is similar to the approach used in many ML techniques such as random forests and conditional random fields.

The end result of such a system is that when we know better heuristics or when we need to make new labels for, say, a named-entity-recognition dataset, we work on coding the labeling functions and rerun Snorkel to produce the dataset for further ML work. Snorkel, therefore, brings ML into the folds of teams that may have considerable domain knowledge but not necessarily the budget or expertise to build and maintain high quality datasets required to train the ML models.

See material published as part of the recent Snorkel workshop to get a gauge of what's coming up in the v0.9 rewrite -

The upcoming Snorkel 0.9 version adds "transform functions" and "slicing functions" to the repertoire, and further generalizes the data programming pipeline.

One question that arises in this case is that if we know how to code up these labeling functions, what would we need the ML models for? The answer lies in the fact that these labeling functions are not expected to be perfect and that relieves a large constraint on the development of such intelligent systems through requirement capture. We need to have enough knowledge to cover the ground, but we don't need them to be perfect even in the aggregate. We can be mix labeling functions with available "gold standard" labels. Snorkel will train a generative model that will capture which labeling functions to rely on, which will then be used to train a discriminative model to produce the actual labels.

Related article (PDF) - Software 2.0 and Snorkel: Beyond Hand-Labeled Data