PhD Sample Background: Historical development of data science and machine learning methods
Chapter 1: Introduction
1.1. Background
The field of data science and machine learning has seen remarkable progress in recent years. From solving simple mathematical problems to offering solutions for complex computational systems. These techniques are widely used to learn, predict and generate insights across multiple sectors. In the computing context, both the terms data science and machine learning are used interchangeably. Although these terms are connected, there is a significant difference between their uses and applications. Data science is the process of developing systems that collect and analyse large sets of information to generate insights (Donoho, 2017). These insights offer useful solutions to complex business challenges and real-world problems.
On the other hand, machine learning is the process of designing self-running software. It uses artificial intelligence to train computers to learn and make decisions. Moreover, machine learning is widely used to develop technology that can automatically improve through experience (Jordan and Mitchell, 2015). The adoption of data science and machine learning can be found in fields such as science, technology, finance, commerce, healthcare, hospitality, education, manufacturing and marketing for making evidence-based and data-driven decisions (Nosratabadi et al., 2020).
While the common applications and advantages of data science and machine learning are covered in recent literature (Çelik, 2018; Sarker et al., 2020), very few studies have highlighted the historical evolution of these technologies. This chapter focuses on the historical development of data science and machine learning methods, highlighting the progression from early statistical techniques to contemporary advanced machine learning models. It explores the methodological, computational and theoretical developments that enable machines to learn from and interpret complex datasets. Examining this trajectory can be a useful step in understanding the technical evolution of the field along with its impact on data identification, processing and analysis in different sectors.
Pre-20th Century: Foundations of Probability and Statistics
Today, data science and machine learning place significance on computing and coding. However, the foundations of these methods can be traced back to the early developments in probability and statistics. These early mathematical innovations play a key role in the foundation of modern statistical learning methods and predictive modelling approaches. For example, the use of probability can be seen in models that predict the uncertainties that occurred at the time of solving complex problems. Similarly, statistics is used to illustrate the behaviour of different predictors derived from large sets of data (Braga-Neto and Dougherty, 2020).
- Early Probability Theory (1654)
The word probability has been derived from the Latin word ‘probo’ and the English words ‘probe’ and ‘probable’. In mathematics, this word represents a sense that has a meaning more or less like plausibility. Earlier, the concept of probability was popularly associated with problems of gambling, dealing with the winning or losing of a game (Debnath and Basu, 2015). However, the birth of classical probability theory was reported in 1654 when the systematic study of probability began. During this time, Blaise Pascal and Pierre de Fermat started a lively correspondence to discuss questions and problems dealing with games of chance, arrangements of objects and the chance of winning a fair game. Their exchange led to the development of fundamental probability concepts, including mean or expected value and conditional probability (Debnath and Basu, 2015).
Furthermore, Çelik (2018) highlighted the use of probability concepts in representing the first mathematical treatment of uncertainty, which now acts as a significant foundation of methods associated with machine learning and statistical inference. Similarly, Tyagi et al. (2022) observed the applications of probability theory in modern machine learning algorithms specially developed for supporting new learnings and optimising complex problems.
- The Bayes’ Theorem (1763)
Today, data science and machine learning provide a framework for integrating information into the computing processes. However, this requires prior data distribution that is followed by observed data using Bayes’ theorem. The Bayes’ theorem came into light in 1763 after the publication of Thomas Bayes’ essay, “An Essay towards solving a Problem in the Doctrine of Chances”. The introduction of Bayes’ theorem completely changed the way statisticians approached probability and inference (Sisson et al., 2018). Furthermore, Bharadiya (2023) noted that Bayes’ theorem can be applied to computing processes to combine existing knowledge with new data to make more accurate predictions. Bayes’ theorem describes the probability of the occurrence of an event associated with different conditions (Reddy et al., 2022). It uses the language of probability to express uncertainty about the phenomena that generate observed data and is often considered for the case of conditional probability (Martin et al., 2020).
- Method of Least Squares (1805)
Although the method of least squares was independently developed by Adrien-Marie Legendre, it was first applied in astronomy and geodesy by Carl Friedrich Gauss (Sarstedt et al., 2021). The introduction of this method marked the beginning of regression analysis and computational statistics (Ruggins, 2015). In recent years, the method of least squares has not only been used to analyse statistical data in economics and social sciences but also for the application of complex statistical methods in econometrics. The least squares method is a type of mathematical regression analysis that is used to identify the line of best fit of a given set of data. It provides a visual representation of the relationship between known independent and unknown dependent variables in different data points (Ruggins, 2015). In simpler terms, this method helps in predicting the behaviour of dependent variables and is widely used by analysts to make predictions, identify new opportunities and define data trends.
Early Statistical Methods
The early 20th century marked a transformative period in statistical methodology. It led to the development of many fundamental techniques that act as a foundation for machine learning and data science methods. During this time, there was an emergence of sophisticated mathematical frameworks for data analysis and computational statistical methods.
- Principal component analysis (1901)
In 1901, Pearson introduced the concept of principal component analysis. Initially developed as a geometric optimisation problem, this method provided a mathematical framework for reducing high-dimensional data while preserving its essential characteristics. This technique emerged from Pearson’s work on finding lines and planes of closest fit to systems of points in space (Bro and Smilde, 2014). In data science, principal component analysis is widely used as a statistical procedure to summarise the information obtained from large data tables into smaller data sets. In other words, when combined with computational processes, principal component analysis enables better visualisation of data (Gewers et al., 2021).
- Linear Discriminant Analysis (1936)
The applications of Linear Discriminant Analysis can be commonly seen in the fields of statistics and machine learning. It is an important technique used by data scientists to optimise large machine learning models. Introduced by Ronald Fisher in 1936, Linear Discriminant Analysis offers a systematic approach to classification problems (Filzmoser et al., 2018). This method established the concept of discriminant functions and demonstrated how to optimise the separation between different classes of observations. In supervised machine learning, it is commonly used to solve multi-class classification problems by separating multiple classes with multiple features through data dimensionality reduction (Obi, 2017).
Beginning of Artificial Intelligence and Neural Networks
During the early 1950s, statistical methods such as multiple regression analysis and nonlinear regression methods were introduced. These methods are considered critically important statistical analyses and are widely used to define the relationships between variables of interest in computational systems (Jarantow et al., 2023). In addition to these statistical methods, there was also an emergence of initial artificial intelligence and neural networks. For example, Alan Turing’s seminal paper “Computing Machinery and Intelligence” introduced the Turing Test, which fundamentally changed how people think about machine intelligence (Gonçalves, 2023). The paper discussed the strengths and weaknesses of new problems that machines would be concerned about. It introduced digital computers to machine intelligence, the possibility of consciousness and thinking. The development of these methods further gained momentum when Frank Rosenblatt represented the first implementation of an artificial neural network capable of learning from data in 1957 (Rosenblatt, 2021). This approach demonstrated that machines could learn to perform pattern recognition tasks through an iterative training process.
To support the training process of machines, various decision tree algorithms were presented. Decision trees help in the prediction and classification of mechanisms that were among the first statistical algorithms. It played a key role in implementing algorithms in electronic computations in the later decades of the 20th century (De Ville, 2013). This was followed by a transformative period for machine learning as several breakthrough methodologies emerged. For example, the support vector machine concept was introduced by Vapnik in 1979 (Rodríguez-Pérez and Bajorath, 2022). While the approach was originally designed for binary object classification, it was later adapted for the prediction of numerical values. Similarly, in 1995, Tin Kam Ho proposed the concept of a random decision forest (Sun et al., 2024) to construct a classifier based on decision trees.
The Era of Data Science and Big Data
With the rise of the internet and digital data, the millennial decade observed multiple technological innovations. During this time, several big data concepts emerged and their potential advantages were visible in the computing sector (Larson and Chang, 2016). This marked a shift in the handling of large data sets. The earlier methods of data processing attempted to extract, transform and load processes into data sets. However, due to the limited scalability of these methods, they were not efficient in handling large volumes of data. The discovery of big data helped in handling and processing both structured and unstructured data using data handling tools (Zhang et al., 2021).
By 2005, members of the statistics community presented strategies to incorporate data science into their field. This led to the development of data science as an independent field in computation. While the classical theories were concerned with understanding data practices, modern definitions provided a newer meaning to data science. It was represented as a combination of computer competency, data mining, statistical knowledge, communication and visualisation skills and business acumen (Shah, 2023). Furthermore, Laney (2001) associated data science with big data based on the similarities in the terms used to define data with high volume, velocity and variety (Kitchin and McArdle, 2016).
Emergence of Deep Learning and Artificial Intelligence
The rising use of computers and smartphones led to the fading out of classical approaches and motivated researchers to dive deeper into the field of data science. A major breakthrough for Deep Learning was seen in the early 2010s, when a large amount of information was incorporated for the training of Artificial Intelligence models, also known as machine learning, intelligent systems and knowledge-based systems (Khan et al., 2021). This further led to the evolution of Artificial Intelligence from a simple program solver to deep learning by addressing more and more complex application domains. However, the first wave of Artificial Intelligence encountered multiple technological barriers that shortened the expectations of researchers.
With the evolution in the market for personal computers, Artificial Intelligence shifted to a knowledge-based system and was programmed using symbolic programming languages such as LISP or Prolog (He et al., 2015). Further increase in computing power and the development of sophisticated mathematical modelling tools brought a new wave of Artificial Intelligence in the form of machine learning (Khan et al., 2021). This new method not only addressed the complex problems but also enabled computers to make autonomous decisions based on previous learning and the scenarios at hand. As researchers delved deeper into networks with more parameters, layers and operations, technologies such as natural language processing emerged (Hirschberg and Manning, 2015). The most recent phase of machine learning has been characterised by large language models and generative Artificial Intelligence. Models like GPT-3, BERT and their successors have demonstrated unprecedented capabilities in natural language processing and generation (Goossens et al., 2023).
Both data science and machine learning are the most rapidly growing technical fields. What started as an extension of computer science and statistics, today, with the integration of artificial intelligence, the use of these methods can be seen in a wide range of applications across different fields. With the rapid deployment of artificial intelligence, there has been a growth in concerns regarding information safety. Additionally, issues such as artificial intelligence ethics, interpretability and bias need to be addressed to ensure data safety and security. Thus, understanding the foundations of key computational theories is important to develop stronger strategies to overcome any issues caused by the rapid evolution in the field of computer science.
References
Bharadiya, J. P. (2023). A review of Bayesian machine learning principles, methods, and applications. International Journal of Innovative Science and Research Technology, 8(5), 2033-2038.
Braga-Neto, U. M., & Dougherty, E. R. (2020). Machine Learning Requires Probability and Statistics [Perspectives]. IEEE Signal Processing Magazine, 37(4), 118-122.
Bro, R., & Smilde, A. K. (2014). Principal component analysis. Analytical methods, 6(9), 2812-2831.
Çelik, Ö. (2018). A research on machine learning methods and its applications. Journal of Educational Technology and Online Learning, 1(3), 25-40.
De Ville, B. (2013). Decision trees. Wiley Interdisciplinary Reviews: Computational Statistics, 5(6), 448-455.
Debnath, L., & Basu, K. (2015). A short history of probability theory and its applications. International Journal of Mathematical Education in Science and Technology, 46(1), 13-39.
Donoho, D. (2017). 50 years of data science. Journal of Computational and Graphical Statistics, 26(4), 745-766.
Filzmoser, P., Hron, K., Templ, M. (2018). Discriminant Analysis. In: Applied Compositional Data Analysis. Springer Series in Statistics. Springer, Cham.
Gewers, L., Ferreira, R., Arruda, D., Silva, N., Comin, H., Amancio, R., & Costa, F. (2021). Principal component analysis: A natural approach to data exploration. ACM Computing Surveys (CSUR), 54(4), 1-34.
Gonçalves, B. (2023). The Turing test is a thought experiment. Minds and Machines, 33(1), 1-31.
Goossens, A., De Smedt, J., & Vanthienen, J. (2023, October). Comparing the performance of GPT-3 with BERT for decision requirements modeling. In International Conference on Cooperative Information Systems (pp. 448-458). Cham: Springer Nature Switzerland.
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision (pp. 1026-1034).
Hirschberg, J., & Manning, D. (2015). Advances in natural language processing. Science, 349(6245), 261-266.
Jarantow, W., Pisors, D., & Chiu, L. (2023). Introduction to the use of linear and nonlinear regression analysis in quantitative biological assays. Current Protocols, 3(6), e801.
Jordan, I., & Mitchell, M. (2015). Machine learning: Trends, perspectives, and prospects. Science, 349(6245), 255-260.
Khan, H., Pasha, M., & Masud, S. (2021). Advancements in microprocessor architecture for ubiquitous AI—An overview on history, evolution, and upcoming challenges in AI implementation. Micromachines, 12(6), 665.
Kitchin, R., & McArdle, G. (2016). What makes Big Data, Big Data? Exploring the ontological characteristics of 26 datasets. Big Data & Society, 3(1), 2053951716631130.
Larson, D., & Chang, V. (2016). A review and future direction of agile, business intelligence, analytics and data science. International Journal of Information Management, 36(5), 700-710.
Martin, M., Frazier, T., & Robert, P. (2020). Computing Bayes: Bayesian computation from 1763 to the 21st century. arXiv preprint arXiv:2004.06425.
Nosratabadi, S., Mosavi, A., Duan, P., Ghamisi, P., Filip, F., Band, S., & Gandomi, H. (2020). Data science in economics: comprehensive review of advanced machine learning and deep learning methods. Mathematics, 8(10), 1799.
Obi, J. C. (2017). A comparative study of the Fisher’s discriminant analysis and support vector machines. European Journal of Engineering and Technology Research, 2(8), 35-40.
Reddy, E. M. K., Gurrala, A., Hasitha, B., & Kumar, R. (2022). Introduction to Naive Bayes and a review on its subtypes with applications. Bayesian reasoning and gaussian processes for machine learning applications, 1-14.
Rodríguez-Pérez, R., & Bajorath, J. (2022). Evolution of support vector machine and regression modeling in chemoinformatics and drug discovery. Journal of Computer-Aided Molecular Design, 36(5), 355-362.
Rosenblatt, F. (2021). The Perceptron: A Probabilistic Model for Information Storage and Organization (1958).
Ruggins, S. M. (2015). A History of Econometrics: The Reformation from the 1970s. Journal of Cultural Economy, 9(2), 226–228.
Sarker, H., Kayes, M., Badsha, S., Alqahtani, H., Watters, P., & Ng, A. (2020). Cybersecurity data science: an overview from machine learning perspective. Journal of Big data, 7, 1-29.
Sarstedt, M., Ringle, M., & Hair, F. (2021). Partial least squares structural equation modeling. In Handbook of market research (pp. 587-632). Cham: Springer International Publishing.
Shah, C. (2023). The past, the present, and the future of information and data sciences: A pragmatic view. Data and Information Management, 7(1), 100028.
Sisson, A., Fan, Y., & Beaumont, M. (Eds.). (2018). Handbook of approximate Bayesian computation. CRC press.
Sun, Z., Wang, G., Li, P., Wang, H., Zhang, M., & Liang, X. (2024). An improved random forest based on the classification accuracy and correlation measurement of decision trees. Expert Systems with Applications, 237, 121549.
Tyagi, A., Kukreja, S., Meghna, N., & Tyagi, K. (2022). Machine learning: Past, present and future. Neuroquantology, 20(8), 4333.
Zhang, Z., Srivastava, R., Sharma, D., & Eachempati, P. (2021). Big data analytics and machine learning: A retrospective overview and bibliometric analysis. Expert Systems with Applications, 184, 115561.