Apache Spark is a cluster computing platform designed to be fast and general-purpose.

On the speed side, Spark extends the popular MapReduce model to efficiently support more types of computations, including interactive queries and stream processing. Speed is important in processing large datasets, as it means the difference between exploring data interactively and waiting minutes or hours. One of the main features Spark offers for speed is the ability to run computations in memory, but the system is also more efficient than MapReduce for complex applications running on disk.


We all are bit familiar with term Data Science, as it is turning out to be a field with potential of new discoveries. Challenges of Data Science seem to be evolutionary, given the amount of data in the world is only going to increase. However, the development of tools and libraries to deal with these challenges can be somewhat called revolutionary. One such product of this revolution is Spark, Over the period of time We would be discussing about its implementations and experiments in details.

Starting with Data Science, a very broad explanatory term in itself, is about retrieving all the information from the trails of data people may leave behind in virtual or physical world. For examples it could be your product browsing history, List of items you have bought from a grocery stores etc. As written by Alpaydin “What we lack in knowledge, we make up for in Data” and Data is considered to be the cheapest raw material ever found .

Now question arise what to do with this data?. How does analysis on them help multinational corporations to cash fortune out of it?  The main purpose of this field, in general opinion, is to understand the nature of data to form better visualization, structures and models to achieve highly accurate results or predictions.

On the other hand Spark provides Mllib, a library, of functions of machine learning which allow one to invoke various algorithms on distributed datasets. As data is represented in form of RDDs in Spark.

Data scientists use their data and analytical ability to find and interpret rich data sources; manage large amounts of data despite hardware, software, and bandwidth constraints; merge data sources; ensure consistency of datasets; create visualizations to aid in understanding data; build mathematical models using the data; and present and communicate the data insights/findings. They are often expected to produce answers in days rather than months, work by exploratory analysis and rapid iteration, and to get/present results with dashboards (displays of current values) rather than papers/reports, as statisticians normally do.



