The best books about big data processing ecosystem

Picked by Tomasz Lelek

Book cover of Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

Book cover of Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow 3e: Concepts, Tools, and Techniques to Build Intelligent Systems

Book cover of Kafka: The Definitive Guide: Real-Time Data and Stream Processing at Scale

Book cover of Advanced Analytics with Spark: Patterns for Learning from Data at Scale

Book cover of Database Internals: A Deep-Dive Into How Distributed Data Systems Work

What did I love about each book?

List by Tomasz Lelek

Why am I passionate about this?

I am motivated by working on products that many people use. I've been a part of companies that deliver products impacting millions of people. To achieve it, I am working in the Big Data ecosystem and striving to simplify it by contributing to Dremio's Data LakeHouse solution. I worked on projects using Spark, HDFS, Cassandra, and Kafka technologies. I have been working in the software engineering industry for ten years now, and I've tried to share my experience and lessons learned in the Software Mistakes and Tradeoffs book, hoping that it will allow current and the next generation of engineers to create better software, leading to more happy users.

The books I picked & why

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

By Martin Kleppmann,

Why did I love this book?

Designing Data-Intensive Applications is the best book if you want to learn about the main principles behind every system that is able to store and process big amounts of data.

You'll learn about distributed storage systems, their tradeoffs (availability, consistency, fault-tolerance), streaming processing systems, and main algorithms.

Those are the critical concepts behind almost every successful company that needs to create scalable solutions.

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

By Martin Kleppmann,

Why should I read it?

2 authors picked Designing Data-Intensive Applications as one of their favorite books, and they share why you should read it.

What is this book about?

Data is at the center of many challenges in system design today. Difficult issues need to be figured out, such as scalability, consistency, reliability, efficiency, and maintainability. In addition, we have an overwhelming variety of tools, including NoSQL datastores, stream or batch processors, and message brokers. What are the right choices for your application? How do you make sense of all these buzzwords? In this practical and comprehensive guide, author Martin Kleppmann helps you navigate this diverse landscape by examining the pros and cons of various technologies for processing and storing data. Software keeps changing, but the fundamental principles remain…

Explore

Topics

Big data

Genres

Design

Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow 3e: Concepts, Tools, and Techniques to Build Intelligent Systems

By Géron Aurélien,

Why did I love this book?

The Hands-on Machine Learning book presents an end-to-end approach to many problems that can be solved with machine learning.

Every concept and topic is backed up with a running code that you can experiment with and adapt to your real-world problems.

Thanks to this book, you will be able to understand the state of the art of today's machine learning and feel comfortable using the most up-to-date ML methods.

Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow 3e: Concepts, Tools, and Techniques to Build Intelligent Systems

By Géron Aurélien,

Why should I read it?

1 author picked Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow 3e as one of their favorite books, and they share why you should read it.

What is this book about?

Through a recent series of breakthroughs, deep learning has boosted the entire field of machine learning. Now, even programmers who know close to nothing about this technology can use simple, efficient tools to implement programs capable of learning from data. This best-selling book uses concrete examples, minimal theory, and production-ready Python frameworks--scikit-learn, Keras, and TensorFlow--to help you gain an intuitive understanding of the concepts and tools for building intelligent systems.

With this updated third edition, author Aurelien Geron explores a range of techniques, starting with simple linear regression and progressing to deep neural networks. Numerous code examples and exercises throughout…

Explore

Topics

Genres

Coming soon!

Kafka: The Definitive Guide: Real-Time Data and Stream Processing at Scale

By Neha Narkhede, Gwen Shapira, Todd Palino

Why did I love this book?

Apache Kafka is the backbone of almost every streaming-based system today.

The solutions created and implemented in Kafka are the key concepts in every streaming system that you will work with.

This book will allow you to fully understand the Kafka architecture, its internals, and APIs and allow you to become an expert in this technology.

Kafka: The Definitive Guide: Real-Time Data and Stream Processing at Scale

By Neha Narkhede, Gwen Shapira, Todd Palino

Why should I read it?

1 author picked Kafka as one of their favorite books, and they share why you should read it.

What is this book about?

Every enterprise application creates data, whether it's log messages, metrics, user activity, outgoing messages, or something else. And how to move all of this data becomes nearly as important as the data itself. If you're an application architect, developer, or production engineer new to Apache Kafka, this practical guide shows you how to use this open source streaming platform to handle real-time data feeds.

Engineers from Confluent and LinkedIn who are responsible for developing Kafka explain how to deploy production Kafka clusters, write reliable event-driven microservices, and build scalable stream-processing applications with this platform. Through detailed examples, you'll learn Kafka's…

Explore

Topics

Genres

Design

Advanced Analytics with Spark: Patterns for Learning from Data at Scale

By Sandy Ryza, Uri Laserson, Sean Owen , Josh Wills

Why did I love this book?

Apache Spark has a very high point of entry for newcomers to the Big Data ecosystem.

However, it is a key tool that almost everyone is using for running distributed processing. I recommend everyone to read this book before delving into production solutions based on Apache Spark.

This book will allow you to alleviate many spark problems, such as serialization, memory utilization, and parallelization of processing.

Advanced Analytics with Spark: Patterns for Learning from Data at Scale

By Sandy Ryza, Uri Laserson, Sean Owen , Josh Wills

Why should I read it?

1 author picked Advanced Analytics with Spark as one of their favorite books, and they share why you should read it.

What is this book about?

In this practical book, four Cloudera data scientists present a set of self-contained patterns for performing large-scale data analysis with Spark. The authors bring Spark, statistical methods, and real-world data sets together to teach you how to approach analytics problems by example. You'll start with an introduction to Spark and its ecosystem, and then dive into patterns that apply common techniques-classification, collaborative filtering, and anomaly detection among others-to fields such as genomics, security, and finance. If you have an entry-level understanding of machine learning and statistics, and you program in Java, Python, or Scala, you'll find these patterns useful for…

Explore

Topics

Genres

Coming soon!

Database Internals: A Deep-Dive Into How Distributed Data Systems Work

By Alex Petrov,

Why did I love this book?

The Database Internals will allow you to go one step further in your understanding of how distributed databases work.

The author has a lot of experience with one of the most successful distributed databases - Apache Cassandra and shares his knowledge about low-level details and internals of distributed databases.

Database Internals: A Deep-Dive Into How Distributed Data Systems Work

By Alex Petrov,

Why should I read it?

1 author picked Database Internals as one of their favorite books, and they share why you should read it.

What is this book about?

When it comes to choosing, using, and maintaining a database, understanding its internals is essential. But with so many distributed databases and tools available today, it's often difficult to understand what each one offers and how they differ. With this practical guide, Alex Petrov guides developers through the concepts behind modern database and storage engine internals.

Throughout the book, you'll explore relevant material gleaned from numerous books, papers, blog posts, and the source code of several open source databases. These resources are listed at the end of parts one and two. You'll discover that the most significant distinctions among many…

Explore

Topics