For those wanting to work with Big Data, it isn’t enough to simply know a programming language and a small scale library.  Once your data reaches many gigabytes, if not terabytes,  in size, working with data becomes cumbersome.  Your computer can only run so fast and store only so much.  At this point, you would look into what kind of tooling is used for massive amounts of data.  One of the tools that you would consider is called Apache Spark.  In this post, we’ll look at what is Spark, what can we do with Spark, and why to use Spark.

What is Apache Spark?

In short Apache Spark is an open source, unified analytics engine for big data.  Spark is written in the Scala programming language.

Why use Spark?

With many different tools that one can use for Big Data, why would you bother with learning Spark?  Why not just learn Hadoop anyway?

In the early days of Big Data, Hadoop was the only tool that could handle huge amounts of data (we’re talking many gigabytes to terabytes). Hadoop is a distributed engine that uses MapReduce to process data from disk.  It was designed at Google to handle their search engine indexing.  For the applications that it was trying to suit, Hadoop’s MapReduce was sufficient.

However, as time went on, Hadoop was simply too slow due to the over-reliance of I/O processing from disk.  The slow speed made it costly to run tasks with huge amounts of data.  In addition, the slow speed also made Hadoop a poor choice with dealing with streaming for real-time applications.

It wasn’t until a few years ago that Spark became a viable tool for data analytics.  Instead of Spark operating on disk, it could perform operations in-memory, dramatically speeding up execution.  Even if Spark needed to work with disk, the built-in mechanisms allowed for faster execution than Hadoop.  The exact mechanisms are beyond the scope of this post.  For those interested, you can visit the Apache Spark homepage for more details.

What can Spark do?

For those looking into analytical tools for your next project, Spark can be used in several aspects:

Machine Learning (MLlib)

Spark is great for applications that require machine learning to get results.  Spark provides numerous algorithms ranging from classification and regression to clustering.  For those that know the scikit-learn library, MLlib allows for distributed machine learning.

Spark SQL

Since working with data efficiently is very important for data analytics, Spark allows for creating datasets that behave like a SQL table.  You can query data either with the SQL SELECT command or you can use Spark’s DataFrame API.  Those that used pandas will have a similar experience with Spark SQL.

Spark Streaming

For determining how to process data, you can either do batch processing or streaming.  Batch processing requires all your data to be present before you can work with your data.  However, for applications that get their data in real time, batch processing simply won’t do.  Instead, Spark has a dedicated component for streaming.  With Spark Streaming, you can create fault tolerant streaming applications that can scale.  Spark Streaming not only integrates nicely with Spark itself (you don’t need much to convert your applications to stream), but also allows for integration with other tools, such as Apache Kafka.

For those who are used to Apache Storm, the idea is similar to Spark Streaming.

GraphX

When we talk about databases, we often refer to relational models.  However, relational models are not always the best model to represent data.  For example, if you have a large network of data, it would be better to represent your data as a Spark.  Thankfully, Spark allows for you to perform graph analytics.

To be honest, I don’t really know this area of Spark very well, so please take the description with a grain of salt.

Conclusion

Apache Spark has boomed in popularity due to speed, ease of use, and the ability to write Scala (recommended), Java, Python (slow due to overhead), R, and SQL programs for Spark.  In addition, integration with many popular tools such as Hadoop, Kafka, and RDBMs.  Finally, Spark can also be used on the cloud, making it a viable option for businesses on a low budget.

Despite the variety of tools that Spark comes with, Spark is not a panacea for every data problem.  Never the less, for those needing to perform analytics with large amounts of data, Spark is a valuable tool that you should learn for working with data.

How do you guys use Apache Spark?  Post your response down below.