Machine learning as a service using Apache Spark

Waled M Tayib, Tennessee State University


In the era of Big Data, machine learning has taken on a whole new role. With the amount of data present around the world, predictions in machine learning applications have become much more accurate. In order to perform all these computations, though, a whole new system architecture is needed. Machine learning has went from a single machine processing megabytes of data to a distributed cluster of machines processing terabytes and even petabytes of data. Many different types of software have achieved this distributed framework, but the two main open source frameworks being used today are Apache Hadoop and Apache Spark. For this project, we will be using Apache Spark. Spark, which was originally developed in the AMPLab at University of California, Berkeley, is a big data in-memory processing framework for batch, real-time, and advanced modeling and analytics [1]. Even though these frameworks exist and are being used by developers and data scientists around the world, they are not very easy to use. They become even more complex when introducing machine learning algorithms and other advanced analytic techniques. In this paper, we will present a new service that will essentially be a “Machine Learning as a Service” model. From the end users’ perspective, all they need to supply is the data they would like to classify or cluster, and basic confidence values. In the back end, our system will calculate the most optimal machine learning algorithm and other performance tuning features based on the user input. The system will also decide the size of the cluster needed to execute the machine-learning job.

Subject Area

Computer science

Recommended Citation

Waled M Tayib, "Machine learning as a service using Apache Spark" (2016). ETD Collection for Tennessee State University. Paper AAI10158649.