55 books like Advanced Analytics with Spark

By Sandy Ryza, Uri Laserson, Sean Owen , Josh Wills

Here are 55 books that Advanced Analytics with Spark fans have personally recommended if you like Advanced Analytics with Spark. Shepherd is a community of 11,000+ authors and super readers sharing their favorite books with the world.

Shepherd is reader supported. When you buy books, we may earn an affiliate commission.

Book cover of Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

Yevgeniy Brikman Author Of Fundamentals of DevOps and Software Delivery: A Hands-On Guide to Deploying and Managing Software in Production

From my list on practical, hands-on books on DevOps and software delivery.

Why am I passionate about this?

I’ve spent more than a decade working on infrastructure, from my early days at LinkedIn, where we had to do a massive DevOps transformation to save the company, to co-founding Gruntwork, where I had the opportunity to work with hundreds of companies on their software delivery practices. From all of this, I can say the following with certainty: the DevOps best practices that a handful of the top tech companies have figured out are not filtering down to the rest of the industry. This is making the entire software industry slower, less effective, and less secure—and I see it as my mission to fix that.

Yevgeniy's book list on practical, hands-on books on DevOps and software delivery

Yevgeniy Brikman Why did Yevgeniy love this book?

This is the best overview of data storage and distributed systems—two key concepts for building almost any piece of software today—that I've seen anywhere. Martin does a wonderful job of taking a massive body of research and distilling complicated concepts and difficult trade-offs down to a level anyone can understand.

I learned a lot about replication, partitioning, linearizability, locking, write skew, phantoms, transactions, event logs, and more. I'm also a big fan of the final chapter, The Future of Data Systems, which covers ideas such as "unbundling the database", end-to-end event streams, and an important discussion on ethics in programming and data systems.

By Martin Kleppmann,

Why should I read it?

2 authors picked Designing Data-Intensive Applications as one of their favorite books, and they share why you should read it.

What is this book about?

Data is at the center of many challenges in system design today. Difficult issues need to be figured out, such as scalability, consistency, reliability, efficiency, and maintainability. In addition, we have an overwhelming variety of tools, including NoSQL datastores, stream or batch processors, and message brokers. What are the right choices for your application? How do you make sense of all these buzzwords? In this practical and comprehensive guide, author Martin Kleppmann helps you navigate this diverse landscape by examining the pros and cons of various technologies for processing and storing data. Software keeps changing, but the fundamental principles remain…


Book cover of Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow 3e: Concepts, Tools, and Techniques to Build Intelligent Systems

Tomasz Lelek Author Of Software Mistakes and Tradeoffs: How to make good programming decisions

From my list on big data processing ecosystem.

Why am I passionate about this?

I am motivated by working on products that many people use. I've been a part of companies that deliver products impacting millions of people. To achieve it, I am working in the Big Data ecosystem and striving to simplify it by contributing to Dremio's Data LakeHouse solution. I worked on projects using Spark, HDFS, Cassandra, and Kafka technologies. I have been working in the software engineering industry for ten years now, and I've tried to share my experience and lessons learned in the Software Mistakes and Tradeoffs book, hoping that it will allow current and the next generation of engineers to create better software, leading to more happy users.

Tomasz's book list on big data processing ecosystem

Tomasz Lelek Why did Tomasz love this book?

The Hands-on Machine Learning book presents an end-to-end approach to many problems that can be solved with machine learning.

Every concept and topic is backed up with a running code that you can experiment with and adapt to your real-world problems.

Thanks to this book, you will be able to understand the state of the art of today's machine learning and feel comfortable using the most up-to-date ML methods.

By Géron Aurélien,

Why should I read it?

1 author picked Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow 3e as one of their favorite books, and they share why you should read it.

What is this book about?

Through a recent series of breakthroughs, deep learning has boosted the entire field of machine learning. Now, even programmers who know close to nothing about this technology can use simple, efficient tools to implement programs capable of learning from data. This best-selling book uses concrete examples, minimal theory, and production-ready Python frameworks--scikit-learn, Keras, and TensorFlow--to help you gain an intuitive understanding of the concepts and tools for building intelligent systems.

With this updated third edition, author Aurelien Geron explores a range of techniques, starting with simple linear regression and progressing to deep neural networks. Numerous code examples and exercises throughout…


Book cover of Kafka: The Definitive Guide: Real-Time Data and Stream Processing at Scale

Tomasz Lelek Author Of Software Mistakes and Tradeoffs: How to make good programming decisions

From my list on big data processing ecosystem.

Why am I passionate about this?

I am motivated by working on products that many people use. I've been a part of companies that deliver products impacting millions of people. To achieve it, I am working in the Big Data ecosystem and striving to simplify it by contributing to Dremio's Data LakeHouse solution. I worked on projects using Spark, HDFS, Cassandra, and Kafka technologies. I have been working in the software engineering industry for ten years now, and I've tried to share my experience and lessons learned in the Software Mistakes and Tradeoffs book, hoping that it will allow current and the next generation of engineers to create better software, leading to more happy users.

Tomasz's book list on big data processing ecosystem

Tomasz Lelek Why did Tomasz love this book?

Apache Kafka is the backbone of almost every streaming-based system today.

The solutions created and implemented in Kafka are the key concepts in every streaming system that you will work with.

This book will allow you to fully understand the Kafka architecture, its internals, and APIs and allow you to become an expert in this technology.

By Neha Narkhede, Gwen Shapira, Todd Palino

Why should I read it?

1 author picked Kafka as one of their favorite books, and they share why you should read it.

What is this book about?

Every enterprise application creates data, whether it's log messages, metrics, user activity, outgoing messages, or something else. And how to move all of this data becomes nearly as important as the data itself. If you're an application architect, developer, or production engineer new to Apache Kafka, this practical guide shows you how to use this open source streaming platform to handle real-time data feeds.

Engineers from Confluent and LinkedIn who are responsible for developing Kafka explain how to deploy production Kafka clusters, write reliable event-driven microservices, and build scalable stream-processing applications with this platform. Through detailed examples, you'll learn Kafka's…


Book cover of Database Internals: A Deep-Dive Into How Distributed Data Systems Work

Tomasz Lelek Author Of Software Mistakes and Tradeoffs: How to make good programming decisions

From my list on big data processing ecosystem.

Why am I passionate about this?

I am motivated by working on products that many people use. I've been a part of companies that deliver products impacting millions of people. To achieve it, I am working in the Big Data ecosystem and striving to simplify it by contributing to Dremio's Data LakeHouse solution. I worked on projects using Spark, HDFS, Cassandra, and Kafka technologies. I have been working in the software engineering industry for ten years now, and I've tried to share my experience and lessons learned in the Software Mistakes and Tradeoffs book, hoping that it will allow current and the next generation of engineers to create better software, leading to more happy users.

Tomasz's book list on big data processing ecosystem

Tomasz Lelek Why did Tomasz love this book?

The Database Internals will allow you to go one step further in your understanding of how distributed databases work.

The author has a lot of experience with one of the most successful distributed databases - Apache Cassandra and shares his knowledge about low-level details and internals of distributed databases.

By Alex Petrov,

Why should I read it?

1 author picked Database Internals as one of their favorite books, and they share why you should read it.

What is this book about?

When it comes to choosing, using, and maintaining a database, understanding its internals is essential. But with so many distributed databases and tools available today, it's often difficult to understand what each one offers and how they differ. With this practical guide, Alex Petrov guides developers through the concepts behind modern database and storage engine internals.

Throughout the book, you'll explore relevant material gleaned from numerous books, papers, blog posts, and the source code of several open source databases. These resources are listed at the end of parts one and two. You'll discover that the most significant distinctions among many…


Book cover of Be Data Literate: The Data Literacy Skills Everyone Needs to Succeed

Jeremy Adamson Author Of Minding the Machines: Building and Leading Data Science and Analytics Teams

From my list on for data science and analytics leaders.

Why am I passionate about this?

I am a leader in analytics and AI strategy, and have a broad range of experience in aviation, energy, financial services, and the public sector.  I have worked with several major organizations to help them establish a leadership position in data science and to unlock real business value using advanced analytics. 

Jeremy's book list on for data science and analytics leaders

Jeremy Adamson Why did Jeremy love this book?

Not everybody needs to be a data scientist, but everybody does need to be data literate. Without an intentional focus on evangelism and building a strong data culture in your organization it will be an uphill battle to make meaningful change. This book helps individuals and leaders to understand what data literacy is, and how we can build it like any other skill.

By Jordan Morrow,

Why should I read it?

1 author picked Be Data Literate as one of their favorite books, and they share why you should read it.

What is this book about?

In the fast moving world of the fourth industrial revolution not everyone needs to be a data scientist but everyone should be data literate, with the ability to read, analyze and communicate with data. It is not enough for a business to have the best data if those using it don't understand the right questions to ask or how to use the information generated to make decisions. Be Data Literate is the essential guide to developing the curiosity, creativity and critical thinking necessary to make anyone data literate, without retraining as a data scientist or statistician. With learnings to show…


Book cover of R for Data Science: Import, Tidy, Transform, Visualize, and Model Data

Tilman M. Davies Author Of The Book of R: A First Course in Programming and Statistics

From my list on intro to programming and data science with R.

Why am I passionate about this?

I’m an applied statistician and academic researcher/lecturer at New Zealand’s oldest university – the University of Otago. R facilitates everything I do – research, academic publication, and teaching. It’s the latter part of my job that motivated my own book on R. From first-year statistics students who have never seen R to my own Ph.D. students using R to implement novel and highly complex statistical methods and models, my experience is that all ultimately love the ease with which the R language permits exploration, visualisation, analysis, and inference of one’s data. The ever-growing need in today’s society for skilled statisticians and data scientists means there's never been a better time to learn this essential language.

Tilman's book list on intro to programming and data science with R

Tilman M. Davies Why did Tilman love this book?

For those intending to use R with an eye on the popular 'Tidyverse' suite of packages – which facilitate the handling, manipulation, and visualisation of data setsit's hard to go past this book. From the founding contributors of the RStudio/Tidyverse worlds, this is a great way to learn about this dialect of R against the overarching backdrop of statistical data analysis and data science.

By Hadley Wickham, Garrett Grolemund,

Why should I read it?

1 author picked R for Data Science as one of their favorite books, and they share why you should read it.

What is this book about?

Learn how to use R to turn raw data into insight, knowledge, and understanding. This book introduces you to R, RStudio, and the tidyverse, a collection of R packages designed to work together to make data science fast, fluent, and fun. Suitable for readers with no previous programming experience, R for Data Science is designed to get you doing data science as quickly as possible. Authors Hadley Wickham and Garrett Grolemund guide you through the steps of importing, wrangling, exploring, and modeling your data and communicating the results. You'll get a complete, big-picture understanding of the data science cycle, along…


Book cover of Introduction to Machine Learning with Python: A Guide for Data Scientists

Yuxi (Hayden) Liu Author Of Python Machine Learning By Example: Build intelligent systems using Python, TensorFlow 2, PyTorch, and scikit-learn

From my list on machine learning for beginners.

Why am I passionate about this?

I have been a machine learning engineer applying my ML expertise in computational advertising, and search domain. I am an author of 8 machine learning books. My first book was ranked the #1 bestseller in its category on Amazon in 2017 and 2018 and was translated into many languages. I am also a ML education enthusiast and used to teach ML courses in Toronto, Canada.  

Yuxi's book list on machine learning for beginners

Yuxi (Hayden) Liu Why did Yuxi love this book?

This book is more advanced than the first book I recommended. It presents ML theoretical and practical aspects step-by-step from the bottom up. Each chapter elaborates at length on a core building block in the ML life cycle. For example, feature engineering, supervised learning, and model evaluation have their own separate chapters, with intuitive discussions of how they work. Most of the concept is taught through the simple yet powerful Python Module Scikit-Learn so it won’t overburden you with heavy programming. This book will be perfect for practitioners with some understanding of statistics and linear algebra.

By Andreas C. Müller, Sarah Guido,

Why should I read it?

1 author picked Introduction to Machine Learning with Python as one of their favorite books, and they share why you should read it.

What is this book about?

Machine learning has become an integral part of many commercial applications and research projects, but this field is not exclusive to large companies with extensive research teams. If you use Python, even as a beginner, this book will teach you practical ways to build your own machine learning solutions. With all the data available today, machine learning applications are limited only by your imagination. You'll learn the steps necessary to create a successful machine-learning application with Python and the scikit-learn library. Authors Andreas Muller and Sarah Guido focus on the practical aspects of using machine learning algorithms, rather than the…


Book cover of Machine Learning For Absolute Beginners: A Plain English Introduction

Yuxi (Hayden) Liu Author Of Python Machine Learning By Example: Build intelligent systems using Python, TensorFlow 2, PyTorch, and scikit-learn

From my list on machine learning for beginners.

Why am I passionate about this?

I have been a machine learning engineer applying my ML expertise in computational advertising, and search domain. I am an author of 8 machine learning books. My first book was ranked the #1 bestseller in its category on Amazon in 2017 and 2018 and was translated into many languages. I am also a ML education enthusiast and used to teach ML courses in Toronto, Canada.  

Yuxi's book list on machine learning for beginners

Yuxi (Hayden) Liu Why did Yuxi love this book?

This could be the first stop of your brand new machine learning journey. I personally like how the technical concept is translated into plain English – each chapter starts with a high-level overview of a ML algorithm or methodology, concise and clear, followed by lots of visual examples and real world scenarios. I can guarantee you won’t get lost halfway. The book focuses on getting you introduced to ML with minimal math. But if you want to grasp some more of math, the next book I recommend is waiting for you. 

By Oliver Theobald,

Why should I read it?

1 author picked Machine Learning For Absolute Beginners as one of their favorite books, and they share why you should read it.

What is this book about?

NOTICE: To buy the newest edition of this book (2021), please search "Machine Learning Absolute Beginners Third Edition" on Amazon. The product page you are currently viewing is for the 2nd Edition (2017) of this book.

Featured by Tableau as the first of "7 Books About Machine Learning for Beginners."

Ready to spin up a virtual GPU instance and smash through petabytes of data? Want to add 'Machine Learning' to your LinkedIn profile?

Well, hold on there...

Before you embark on your epic journey, there are some high-level theory and statistical principles to weave through first.
But rather than spend…


Book cover of Information Quality: The Potential of Data and Analytics to Generate Knowledge

Ron S. Kenett Author Of The Real Work of Data Science: Turning Data into Information, Better Decisions, and Stronger Organizations

From my list on how numbers turn into information.

Why am I passionate about this?

I was trained as a mathematician but have always been motivated by problem-solving challenges. Statistics and analytics combine mathematical models with statistical thinking. My career has always focused on this combination and, as a statistician, you can apply it in a wide range of domains. The advent of big data and machine learning algorithms has opened up new opportunities for applied statisticians. This perspective complements computer science views on how to address data science. The Real Work of Data Science, covers 18 areas (18 chapters) that need to be pushed forward in order to turning data into information, better decisions, and stronger organizations

Ron's book list on how numbers turn into information

Ron S. Kenett Why did Ron love this book?

A lightly technical introduction to a comprehensive framework defining and evaluating the quality of information generated by statistical analysis. It expands the role of analytics by including dimensions that affect information quality such as data resolution, data integration, operationalization, and generalizability of findings. This wide-angle perspective provides a practical checklist that has been found useful in applications. Multiple case studies enable the reader to connect to his favorite topic, but also learn from other areas.

By Ron S. Kenett, Galit Shmueli,

Why should I read it?

1 author picked Information Quality as one of their favorite books, and they share why you should read it.

What is this book about?

Provides an important framework for data analysts in assessing the quality of data and its potential to provide meaningful insights through analysis Analytics and statistical analysis have become pervasive topics, mainly due to the growing availability of data and analytic tools. Technology, however, fails to deliver insights with added value if the quality of the information it generates is not assured. Information Quality (InfoQ) is a tool developed by the authors to assess the potential of a dataset to achieve a goal of interest, using data analysis. Whether the information quality of a dataset is sufficient is of practical importance…


Book cover of Rage Inside the Machine: The Prejudice of Algorithms, and How to Stop the Internet Making Bigots of Us All

Peter J. Bentley Author Of Artificial Intelligence and Robotics: Ten Short Lessons

From my list on no hype and no nonsense artificial intelligence.

Why am I passionate about this?

I’ve been a geeky kid all my life. (I don’t think I’ve quite grown up yet.) Born in the 1970s, my childhood was a wonderful playground of building robots and software. I was awarded one of the early degrees in AI, and a PhD in genetic algorithms. I’ve since spent 25 years exploring how to make computers think, build, invent, compose… and I’ve also spent 20 years writing popular science books. I’m lucky enough to be a Professor in one of the world’s best universities for Computer Science and Machine Learning: UCL, and I guess I’ve written two or three hundred scientific papers over the years. I still think I know nothing at all about real or artificial intelligence, but then does anyone?

Peter's book list on no hype and no nonsense artificial intelligence

Peter J. Bentley Why did Peter love this book?

OK, I’m biased here because Rob is an old friend of mine. We first met at academic conferences and had several heated debates (arguments). But after spending a little time together at a workshop we realised each probably knew what they were talking about after all. Robert Elliott Smith, I should make clear it's not the Rob Smith who writes about “Artificial Superintelligence”. Those books definitely do not make this list.

Our Rob is a coherent, grounded scientist with bags of real-world experience, and he brings his knowledge to this title with gusto, telling us about how AI is affecting our lives in ways you never thought possible – and often not in a good way. If you want to understand what can go wrong with AI and what we should be doing to stop it, don’t read about singularities or other such nonsense, read this.

By Robert Elliott Smith,

Why should I read it?

1 author picked Rage Inside the Machine as one of their favorite books, and they share why you should read it.

What is this book about?

Shortlisted for the 2020 Business Book Awards

We live in a world increasingly ruled by technology; we seem as governed by technology as we do by laws and regulations. Frighteningly often, the influence of technology in and on our lives goes completely unchallenged by citizens and governments. We comfort ourselves with the soothing refrain that technology has no morals and can display no prejudice, and it's only the users of technology who distort certain aspects of it.

But is this statement actually true? Dr Robert Smith thinks it is dangerously untrue in the modern era.

Having worked in the field…


5 book lists we think you will like!

Interested in data mining, big data, and machine learning?

11,000+ authors have recommended their favorite books and what they love about them. Browse their picks for the best books about data mining, big data, and machine learning.

Data Mining Explore 13 books about data mining
Big Data Explore 29 books about big data
Machine Learning Explore 50 books about machine learning