What Is Apache Spark?

Apache Spark is a cluster computing platform designed to be fast and general-purpose.

On the speed side, Spark extends the popular MapReduce model to efficiently support more types of computations, including interactive queries and stream processing. Speed is important in processing large datasets, as it means the difference between exploring data interactively and waiting minutes or hours. One of the main features Spark offers for speed is the ability to run computations in memory, but the system is also more efficient than MapReduce for complex applications running on disk.

On the generality side, Spark is designed to cover a wide range of workloads that previously required separate distributed systems, including batch applications, iterative algorithms, interactive queries, and streaming. By supporting these workloads in the same engine, Spark makes it easy and inexpensive to combine different processing types, which is often necessary in production data analysis pipelines. In addition, it reduces the management burden of maintaining separate tools.

Spark is designed to be highly accessible, offering simple APIs in Python, Java, Scala, and SQL, and rich built-in libraries. It also integrates closely with other Big Data tools. In particular, Spark can run in Hadoop clusters and access any Hadoop data source, including Cassandra.

Spark Streaming

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards. In fact, you can apply Spark’s machine learning and graph processing algorithms on data streams.

Spark Streaming Training Chennai provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data. DStreams can be created either from input data streams from sources such as Kafka, Flume, and Kinesis, or by applying high-level operations on other DStreams. Internally, a DStream is represented as a sequence of RDDs.

Why is Spark Streaming needed?

A Stateful Stream Processing System is a system that needs to update its state with the stream of data. Latency should be low for such a system, and even if a node fails, the state should not be lost (for example, computing the distance covered by a vehicle based on a stream of its GPS location, or counting the occurrences of word “spark” in a stream of data).

Batch processing systems like Hadoop have a high latency and are not suitable for near real time processing requirements. Storm guarantees processing of a record if it hasn’t been processed, but this can lead to inconsistency as a record could be processed twice. If a node running Storm goes down, then the state is lost. In most environments, Hadoop and Storm (or other stream processing systems) have been used for batch processing and stream processing, respectively. The use of two different programming models causes an increase in code size, number of bugs to fix, development effort, introduces a learning curve, and causes other issues. Spark Streaming helps in fixing these issues and provides a scalable, efficient, resilient, and integratabtle (with batch processing) system.


