The best books if you want to become a data scientist

Valliappa Lakshmanan Author Of Data Science on the Google Cloud Platform: Implementing End-To-End Real-Time Data Pipelines: From Ingest to Machine Learning
By Valliappa Lakshmanan

Who am I?

I started my career as a research scientist building machine learning algorithms for weather forecasting. Twenty years later, I found myself at a precision agriculture startup creating models that provided guidance to farmers on when to plant, what to plant, etc. So, I am part of the movement from academia to industry. Now, at Google Cloud, my team builds cross-industry solutions and I see firsthand what our customers need in their data science teams. This set of books is what I suggest when a CTO asks how to upskill their workforce, or when a graduate student asks me how to break into the industry.

I wrote...

Data Science on the Google Cloud Platform: Implementing End-To-End Real-Time Data Pipelines: From Ingest to Machine Learning

By Valliappa Lakshmanan,

Book cover of Data Science on the Google Cloud Platform: Implementing End-To-End Real-Time Data Pipelines: From Ingest to Machine Learning

What is my book about?

This hands-on guide shows data engineers and data scientists how to implement an end-to-end data pipeline, using statistical and machine learning methods and tools on Google Cloud Platform (GCP).

Through the course of this updated second edition, you'll work through a sample business decision by employing a variety of data science approaches. Follow along by implementing these statistical and machine learning solutions in your own project on GCP, and discover how this platform provides a transformative and more collaborative way of doing data science.

Shepherd is reader supported. When you buy books, we may earn an affiliate commission

The books I picked & why

Effective Pandas

By Matt Harrison,

Book cover of Effective Pandas

Why did I love this book?

Even if you are ultimately going to be working with terabytes of data, you’ll start out doing exploratory data analysis. The tool that you’ll use for that is most likely going to be Pandas. One of the best investments that you can make when becoming a data scientist is to become a Pandas expert, and there is no better book than Harrison’s to help you get there. Plus, many of the interview questions you will face during the hiring process will probably involve Pandas. Blow your interviewers out of the water by showing them corners of the Pandas library they didn’t even know!

By Matt Harrison,

Why should I read it?

1 author picked Effective Pandas as one of their favorite books, and they share why you should read it.

What is this book about?

Best practices for manipulating data with Pandas. This book will arm you with years of knowledge and experience that are condensed into an easy to follow format. Rather than taking months reading blogs and websites and searching mailing lists and groups, this book will teach you how to write good Pandas code.

It covers: Series manipulation Creating columns Summary statistics Grouping, pivoting, and cross-tabulation Time series data Visualization Chaining Debugging code and more...

Jumpstart Snowflake: A Step-by-Step Guide to Modern Cloud Analytics

By Dmitry Anoshin, Dmitry Shirokov, Donna Strok

Book cover of Jumpstart Snowflake: A Step-by-Step Guide to Modern Cloud Analytics

Why did I love this book?

In industry, your data is very likely to live within a data warehouse such as BigQuery, Redshift, or Snowflake. Therefore, to be an effective data scientist in the industry, you should learn how to use data warehouses effectively. 

Once you learn data warehousing and SQL with any one of these products, it is quite easy to pick up another. So which one do you start with?

You can use Snowflake on all three of the major public clouds. Because it’s a standalone product, it is the most similar to a “traditional” data warehouse and can be picked up easily even if you are not familiar with cloud computing. That makes it a good data warehouse to start with, and is the reason my second book pick is this book on Snowflake.

BigQuery is also available on all three major public clouds, but it works best (and is used most commonly) on Google Cloud. Because BigQuery is truly serverless (you pay by the query and never deal with clusters or virtual data warehouses), it is quite unlike traditional data warehouses and you will have to learn some public cloud concepts in order to use BigQuery. On the other hand, starting with BigQuery has several advantages — first, it offers 1 TB of querying per month for free; second, it has machine learning built-in — Google Colab even offers a free Jupyter notebook from which to access BigQuery; and third, it’s the best choice for production uses cases as BigQuery is typically more scalable and less expensive than the alternatives. If you are willing to learn public cloud, start with the Definitive Guide to BigQuery.

AWS is the most widely used cloud, and Redshift is the most widely used data warehouse on AWS. Your organization probably already has a Redshift cluster set up and ready to go. The path of least resistance might be to learn data warehousing using the AWS book on Redshift.

By Dmitry Anoshin, Dmitry Shirokov, Donna Strok

Why should I read it?

1 author picked Jumpstart Snowflake as one of their favorite books, and they share why you should read it.

What is this book about?

Explore the modern market of data analytics platforms and the benefits of using Snowflake computing, the data warehouse built for the cloud.

With the rise of cloud technologies, organizations prefer to deploy their analytics using cloud providers such as Amazon Web Services (AWS), Microsoft Azure, or Google Cloud Platform. Cloud vendors are offering modern data platforms for building cloud analytics solutions to collect data and consolidate into single storage solutions that provide insights for business users. The core of any analytics framework is the data warehouse, and previously customers did not have many choices of platform to use.

Snowflake was…

Book cover of Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die

Why did I love this book?

As a data scientist in the industry, it is very helpful to understand the business context behind the problems that you are solving. In many cases, you are trying to predict behavior—who is likely to buy an item, who is likely to click on a link, who is likely to repay a loan, etc.

This book by Eric Siegel is a great introduction to predictive analytics as used in real-life. It will help you frame data science problems in standard ways. For example, suppose you are asked to score sales leads so that salespeople can prioritize their efforts. How would you do it? The common way to frame this problem is to predict the customer lifetime value (LTV) of every sales lead. Before you can do prediction, you have to be able to do analysis though.

The way you estimate the LTV is to break the problem into three sub-problems: finding the average order value, the average number of transactions per year, and of how long an average customer sticks with your product. Once you know how to estimate the LTV of existing customers, you will be able to create a system to predict LTV by comparing the attributes of the sales lead to your existing customer base. This is by no means obvious, and reading a book like this will help you learn the typical approach.

By Eric Siegel,

Why should I read it?

1 author picked Predictive Analytics as one of their favorite books, and they share why you should read it.

What is this book about?

"Mesmerizing & fascinating..." -The Seattle Post-Intelligencer

"The Freakonomics of big data." -Stein Kretsinger, founding executive of

Award-winning | Used by over 30 universities | Translated into 9 languages

An introduction for everyone. In this rich, fascinating - surprisingly accessible - introduction, leading expert Eric Siegel reveals how predictive analytics (aka machine learning) works, and how it affects everyone every day. Rather than a "how to" for hands-on techies, the book serves lay readers and experts alike by covering new case studies and the latest state-of-the-art techniques.

Prediction is booming. It reinvents industries and runs the world. Companies, governments, law…

Book cover of The Art of Statistics: How to Learn from Data

Why did I love this book?

What if you are faced with a problem for which a standard approach doesn’t yet exist? In such a case, you will need to be able to figure out the approach from the first principles. This book will help you learn how to derive insights starting from raw data.

By David Spiegelhalter,

Why should I read it?

2 authors picked The Art of Statistics as one of their favorite books, and they share why you should read it.

What is this book about?

'A statistical national treasure' Jeremy Vine, BBC Radio 2

'Required reading for all politicians, journalists, medics and anyone who tries to influence people (or is influenced) by statistics. A tour de force' Popular Science

Do busier hospitals have higher survival rates? How many trees are there on the planet? Why do old men have big ears? David Spiegelhalter reveals the answers to these and many other questions - questions that can only be addressed using statistical science.

Statistics has played a leading role in our scientific understanding of the world for centuries, yet we are all familiar with the way…

Book cover of Fundamentals of Data Visualization: A Primer on Making Informative and Compelling Figures

Why did I love this book?

It is not enough for a data scientist to be able to analyze data and build ML models. You have to be able to communicate the insights to decision-makers concisely and accurately. This book shows you bad and good visualizations — you’ll be surprised by how often you would have defaulted to the bad way without the guidance provided by this book!

By Claus O. Wilke,

Why should I read it?

1 author picked Fundamentals of Data Visualization as one of their favorite books, and they share why you should read it.

What is this book about?

Effective visualization is the best way to communicate information from the increasingly large and complex datasets in the natural and social sciences. But with the increasing power of visualization software today, scientists, engineers, and business analysts often have to navigate a bewildering array of visualization choices and options.

This practical book takes you through many commonly encountered visualization problems, and it provides guidelines on how to turn large datasets into clear and compelling figures. What visualization type is best for the story you want to tell? How do you make informative figures that are visually pleasing? Author Claus O. Wilke…

5 book lists we think you will like!

Interested in data science, statistics, and data processing?

9,000+ authors have recommended their favorite books and what they love about them. Browse their picks for the best books about data science, statistics, and data processing.

Data Science Explore 23 books about data science
Statistics Explore 25 books about statistics
Data Processing Explore 21 books about data processing

And, 3 books we think you will enjoy!

We think you will like Factfulness, The Numbers Game, and Invisible Women if you like this list.