Quite often these days, one gets the picture that the big data/data science/artificial intelligence hype train is exponentially gathering steam. Although, if truth be told, I don’t think grouping all of these seemingly related but still distinct fields in a single chunk and dismissing them as nothing more sophisticated than another version of Gartner’s fabled “hype curve” (or something similar along those lines, I am paraphrasing very generously here so it’s better to Google this one up for specifics) is fair to the immense potential for doing good that all of these fields possess.
While hacking away on data science/machine learning projects, it is easy to get overwhelmed by the sheer amount of learning material out there, and get pigeonholed into a specific algorithm or learning technique ( or rather, a lame pun that users of random forest algorithms would roll their eyes at: missing the forest while searching through the trees) . I had a strong feeling that my own experience in data science and machine learning was not helping me to find implicit connections between the different sub topics and get a thorough, but still meta-abstract (no idea whether this word exists or not, but oh well) grasp on the topics. I was therefore looking for a book which, while not exactly being pop-sci, would give an overview of the different techniques in machine learning, along with a theoretical/ philosophical understanding on the different schools of thoughts, and how these different paradigms interact with one another. Coincidentally, Bill Gates, arguably the nerdiest ex-founder/current thought leader our there, recommended the Master Algorithm by Pedro Domingos as one of the books for getting a better understanding of machine learning.
Domingos runs a fairly popular MOOC about machine learning on Coursera, and is a frequent speaker on the topic in the conferences/speaking circuit . In the Master Algorithm, he takes the reader through the five main schools of thought in machine learning, the philosphical arguments that undermine the thought process in each of these schools, the ideological clashes that the different camps in the field have with each other, cases in which each of these techniques have shown promising results, instances where the results have not been that promising,and finally, ways in which the different schools of thought can be combined to come up with a “master algorithm”, or a learner that can predict the past, present and future of the universe. That is the major theme that runs throughout the book, namely the fact that the ultimate algorithm lies at the intersection of the different ideological camps in machine learning, rather than one camp hoping to take all the spoils home.
According to Domingos, the five main ideological camps in machine learning are as follows:
The symbolists believe that learning is nothing more sophisticated, or rather, as sophisticated as deriving a series of logical rules that follow one another. In their worldview, learning is the inverse of deduction, i.e. you look at the results, and then infer the logical rules that caused those results. While symbolism holds up strongly to the traditional rigour of logical, philosophical and inductive reasoning, there is an inherent problem that lies while using only symbolism for creating large self prediction systems. In simple use cases, logical reasoning can hold its ground, but in an influx of variables and dependencies, the ground immediately turns to quicksand. Even creating something as simple in today’s world as a voter prediction system that only uses inductive reasoning to predict which of the two evils (i.e Donald or Hillary, and yes, I am still #feelingthebern) a rational voter would go for becomes too complex and cumbersome to be used in real-time by a large number of users. Therefore, a series of general, logical based induction rules is probably not the sole route to be taken to the Grand Algorithm.
The beauty of the human brain and its seemingly effortless functionality has captivated generations of humans. Despite our best efforts, we are still at the very least decades away from unlocking all the mysteries of the brain’s functions. It is therefore not surprising that a large cohort of machine learning scientists are firm proponents of modelling learning algorithms along the lines of how the human brain learns. Neural networks, which are currently one of the most popular machine learning algorithms, are based on the inherent idea of how the human brain works : namely through a dense layer of hidden “neurons” that take input from different sensors/ other neurons, and generate an output that is then read by other neurons that perform the same function. The output is propogated through a series of these “hidden layers” of neurons until the system generates an output. Although the mapping of the brain’s functions has come a long way from the early days when it was believed that the brain generates output only through a series of simple perceptrons, the connectionists cannot lay claim to have uncoded the Master Algorithm yet. Connectionists have had immense success in a lot of instances, yet modelling the human brain completely is a tremendously diffcult task, and just as there are functions of the brain that neuroscientists still have no idea about, often, complex neural networks end up becoming black boxes of their own, with their human trainers unable to decipher what exactly is going on in the “hidden layers” of their machine creations. Thus, quite literally, we often do not know whether our android servants are really dreaming of electric sheep or not.
It is a well accepted fact that evolution is one of nature’s crowning achievements, a seemingly simple, yet highly effective process which ensures that species adapt over time, pass on the most valuable traits to their offsprings and do not get obliterated by environmental conditions. Proponents of evolutionary methods in machine learning follow the same idea in principle. Rather than mimicking how the human brain functions or mapping down learning through a set of intricate, logical rules, the evolutionarists believe that we need to move a few steps backw in our approach,and address the core issue : how exactly does evolution result in the development of a complex structure like the human brain, which not only has the propensity to nurture itself and learn new things, but also inherits several traits by nature itself (yet another incarnation of the nature vs nurture debate). The algorithm is divided into a set of sub programs, which are allowed to mutate and recombine with other sub programs to see whether with multiple iterations of the program, the results of the prediction improve. Genetic programming is the bread and butter of evolutionarists, and similar to how evolution ensures “survival of the fittest”, a fitness function ensures that subsequent iterations of the algorithm give better predictions. When further mutations of the program do not render better predictions with the fitness function, the program stops mutating.
Genetic programming works surprisingly well, both in priciple as well as in practice with neural networks. The work flow is as follows:
The fitness function runs through several iterations of the program until a general structure of the program is created.
Neural networks are then run on this overall program structure to train this “brain” to learn.
However, the archilles heel of genetic programming is one head of the proverbial complexity monster. While solving dense problems, continuous iterations of the sub-programs can be a time and memory intensive process. Moreover, in the quest of improving its performance in every step of the iteration, the program often gets stuck in a local optima, i.e. it misses the path to the global optima while insisting that its immediate path yields a better result, rather than going through iterations that give lower performance in the short term, but end up giving a substantially better performance in the long term.
Bayes’ Therom is perhaps the most popular, and at the same time, ridiculously easy to understand concept in statistics, which can be drilled down to the following:
P(A|B) = P(A) * P(B|A) / P(B)
Behind this chain of probabilities is an even simpler heurestic, and one which we often use in our lives without explicitly thinking about it : we assign prior probabilities to events that we think will occur, and as we observe more evidence, our priors either lose their importance or become more important. For instance, let us consider Leicester City’s title win in the English Premier League this season. At the start of the season, the prior probability of Leicester winning the league would have been horrendously low, given their past performance in the league. However, as Leicester continued their dream run throughout the season, the evidence, which in this case was the wins and draws that they achieved and the losses and draws that their opponents collected, gradually increased the prior probability of Leicester achieving the ultimate underdog dream until the odds of them winning could no longer be ignored.
Bayes’ Theorem might sound too simple to be of much practical use in the seemingly complicated world of training machines to learn on their own, but several algorithms based on it, such as Naive Bayes, markov chains, and hidden markov models have been used in some truly complex problems, such as word predictions and search engine ranking. Nevertheless, the Bayesians also cannot lay claim to the Master Algorithm. Inspite of the simplicity and beauty of Bayes’ Theorem, it often fails spectacularly in instances where it just sees one new example which forces it to radically change its prior probabilities. Spam filters are a good example of when Bayes Theorem often fails.
The analogisers use what is perhaps the most intuitive way of learning for humans : using analogies and learnings from other fields to make decisions. While this may sound easy for humans, training machines on one set of instructions and then translating that learning into meta-knowledge that can be applied across different domains is a ridiculously tough task. Analogisers use unsupervised learning algorithms such as clustering, k means clustering, principle component analysis and support vector machines to infer patters from information, and then use those patterns to make decisions on new data.
Although the Master Algorithm is a very interesting read, I did feel at times that Domingos perhaps overestimated the power of predictive models in chaotic systems, such as the economy. At more than one instance in the book, he directly referenced Nasim Nicholas Taleb’s idea of “the black swan” (a never-seen-before event that all prediction gurus ignore or miss, often at a great peril to society), and gave vague examples of how machine learning can indeed be used to predict these black swans as well. There are still loads of empty promises being made on just how transparent a crystal ball machine learning and data science are in forseeing human behaviour in different settings. My own understanding is that although machine learning can be immensely effective in predictions in a wide array of fields, ranging from medical diagnostics and predictive policing to measuring the effectiveness of policy initiatives in the development sector (my personal favourite), there is still a long way to go before it can be used to accurately predict systems that are dependent on a large number of irrational beings (aka humans) interacting with each other. Until we reach that point, it is perhaps better to listen to Taleb and the other critics who question the claims currenly made by machine learning. Nevertheless, hype notwithstanding, machine learning and data science shall easily ( hopefully?) remain hot topics for some of the most exciting research and discoveries for the next few decades.