Big Data Processing and Analysis Frameworks
By: Koffka Khan
Narrated by: Virtual Voice
Length: 9 hrs and 52 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to Cart failed.

Please try again later

Add to Wish List failed.

Please try again later

Remove from wishlist failed.

Please try again later

Adding to library failed

Please try again

Follow podcast failed

Please try again

Unfollow podcast failed

Please try again

Try for $0.00

Prime members: New to Audible?
Get 2 free audiobooks during trial.

Pick 1 audiobook a month from our unmatched collection.

Listen all you want to thousands of included audiobooks, Originals, and podcasts.

Access exclusive sales and deals.

Premium Plus auto-renews for $14.95/mo after 30 days. Cancel anytime.

Big Data Processing and Analysis Frameworks

By: Koffka Khan

Narrated by: Virtual Voice

Try for $0.00

$14.95/month after 30 days. Cancel anytime.

Buy for $3.99

No default payment method selected.

Add payment method

Switch payment method

We are sorry. We are not allowed to sell this product with the selected payment method

Switch payment method

Pay using card ending in

Switch payment method

By confirming your purchase, you agree to Audible's Conditions of Use and Amazon's Privacy Notice. Taxes where applicable.

This title uses virtual voice narration

Virtual voice is computer-generated narration for audiobooks

Publisher's summary

This book part I focuses on Apache Hadoop. This is broken down into 16 chapters. Chapter 1 gives the Introduction. In chapter 2 we explore the Big Data Problem. Chapter 3 illustrates a Big Data Scenario while chapter 4 introduces Apache Hadoop. The Hadoop architecture is given in chapter 5. Chapter 6 introduces HDFS. The benefits of distributed file systems is discussed in chapter 7. Chapter 8 and 9 explains writing and reading files. In chapter 10 MapReduce is introduced with MapReduce Programming in chapter 11. YARN is introduced in chapter 12 and its architecture given in chapter 13. The Hadoop architecture is explored in chapter 14. In chapter 15 we discuss the Hadoop Cluster. Finally, in chapter 16 the Hadoop ecosystem is given.
Apache Kafka is a distributed data storage for ingesting and processing streaming data in real time. Streaming data is information that is continuously produced by hundreds of data sources that all send data records in at the same time. A streaming platform must be able to handle a steady stream of data and process it in a sequential and progressive manner. The technique is frequently used to build real-time streaming data pipelines that enable streaming analytics and mission-critical use cases with guaranteed ordering, no message loss, and processing that happens exactly once.
Apache Kafka is extremely scalable and quick because it allows data to be distributed across several servers. It decouples data streams and thereby reduces latency. It can also distribute and duplicate partitions over other servers, preventing server failure. This book part reviews the operations of this important distributed stream processing system. This book part consists of six chapters. Chapter 1 gives the introduction. In chapter 2 we describe the components of Kafka within its environment. Chapter 3 describes Zookeeper with Kafka events and streams explained in chapter 4. Use cases of Kafka are given in chapter 5. Finally, in chapter 6 the findings are stated.
This book part (Apache Spark) has six chapters. The first chapter is the Introduction. In the second chapter we discuss the components of Spark. In-memory processing is our area of discussion in chapter three. In chapter 4 we discuss MapReduce vs Spark. In chapter five we talk about Apache Spark Streaming. Finally, in chapter six we speak briefly about Apache Spark MLlib.
This book part (Apache Hive) consists of two chapters. The first gives an introduction to Apache Hive and the second describes its architecture. Apache Hive is an open source data warehouse program for reading, writing, and managing massive data sets stored in the Apache Hadoop Distributed File System (HDFS) or other data storage systems such as Apache HBase.