The best books if you want to become a data scientist

Valliappa Lakshmanan Author Of Data Science on the Google Cloud Platform: Implementing End-To-End Real-Time Data Pipelines: From Ingest to Machine Learning
By Valliappa Lakshmanan

Who am I?

I started my career as a research scientist building machine learning algorithms for weather forecasting. Twenty years later, I found myself at a precision agriculture startup creating models that provided guidance to farmers on when to plant, what to plant, etc. So, I am part of the movement from academia to industry. Now, at Google Cloud, my team builds cross-industry solutions and I see firsthand what our customers need in their data science teams. This set of books is what I suggest when a CTO asks how to upskill their workforce, or when a graduate student asks me how to break into the industry.


I wrote...

Data Science on the Google Cloud Platform: Implementing End-To-End Real-Time Data Pipelines: From Ingest to Machine Learning

By Valliappa Lakshmanan,

Book cover of Data Science on the Google Cloud Platform: Implementing End-To-End Real-Time Data Pipelines: From Ingest to Machine Learning

What is my book about?

This hands-on guide shows data engineers and data scientists how to implement an end-to-end data pipeline, using statistical and machine learning methods and tools on Google Cloud Platform (GCP).

Through the course of this updated second edition, you'll work through a sample business decision by employing a variety of data science approaches. Follow along by implementing these statistical and machine learning solutions in your own project on GCP, and discover how this platform provides a transformative and more collaborative way of doing data science.

The books I picked & why

Shepherd is readers supported. When you buy through links on our website, we may earn an affiliate commission. This is how we fund this project for readers and authors (learn more).

Effective Pandas: Patterns for Data Manipulation

By Matt Harrison,

Book cover of Effective Pandas: Patterns for Data Manipulation

Why this book?

Even if you are ultimately going to be working with terabytes of data, you’ll start out doing exploratory data analysis. The tool that you’ll use for that is most likely going to be Pandas. One of the best investments that you can make when becoming a data scientist is to become a Pandas expert, and there is no better book than Harrison’s to help you get there. Plus, many of the interview questions you will face during the hiring process will probably involve Pandas. Blow your interviewers out of the water by showing them corners of the Pandas library they didn’t even know!


Jumpstart Snowflake: A Step-by-Step Guide to Modern Cloud Analytics

By Dmitry Anoshin, Dmitry Shirokov, Donna Strok

Book cover of Jumpstart Snowflake: A Step-by-Step Guide to Modern Cloud Analytics

Why this book?

In industry, your data is very likely to live within a data warehouse such as BigQuery, Redshift, or Snowflake. Therefore, to be an effective data scientist in the industry, you should learn how to use data warehouses effectively. 

Once you learn data warehousing and SQL with any one of these products, it is quite easy to pick up another. So which one do you start with?

You can use Snowflake on all three of the major public clouds. Because it’s a standalone product, it is the most similar to a “traditional” data warehouse and can be picked up easily even if you are not familiar with cloud computing. That makes it a good data warehouse to start with, and is the reason my second book pick is this book on Snowflake.

BigQuery is also available on all three major public clouds, but it works best (and is used most commonly) on Google Cloud. Because BigQuery is truly serverless (you pay by the query and never deal with clusters or virtual data warehouses), it is quite unlike traditional data warehouses and you will have to learn some public cloud concepts in order to use BigQuery. On the other hand, starting with BigQuery has several advantages — first, it offers 1 TB of querying per month for free; second, it has machine learning built-in — Google Colab even offers a free Jupyter notebook from which to access BigQuery; and third, it’s the best choice for production uses cases as BigQuery is typically more scalable and less expensive than the alternatives. If you are willing to learn public cloud, start with the Definitive Guide to BigQuery.

AWS is the most widely used cloud, and Redshift is the most widely used data warehouse on AWS. Your organization probably already has a Redshift cluster set up and ready to go. The path of least resistance might be to learn data warehousing using the AWS book on Redshift.


Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die

By Eric Siegel,

Book cover of Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die

Why this book?

As a data scientist in the industry, it is very helpful to understand the business context behind the problems that you are solving. In many cases, you are trying to predict behavior—who is likely to buy an item, who is likely to click on a link, who is likely to repay a loan, etc.

This book by Eric Siegel is a great introduction to predictive analytics as used in real-life. It will help you frame data science problems in standard ways. For example, suppose you are asked to score sales leads so that salespeople can prioritize their efforts. How would you do it? The common way to frame this problem is to predict the customer lifetime value (LTV) of every sales lead. Before you can do prediction, you have to be able to do analysis though.

The way you estimate the LTV is to break the problem into three sub-problems: finding the average order value, the average number of transactions per year, and of how long an average customer sticks with your product. Once you know how to estimate the LTV of existing customers, you will be able to create a system to predict LTV by comparing the attributes of the sales lead to your existing customer base. This is by no means obvious, and reading a book like this will help you learn the typical approach.


The Art of Statistics: How to Learn from Data

By David Spiegelhalter,

Book cover of The Art of Statistics: How to Learn from Data

Why this book?

What if you are faced with a problem for which a standard approach doesn’t yet exist? In such a case, you will need to be able to figure out the approach from the first principles. This book will help you learn how to derive insights starting from raw data.


Fundamentals of Data Visualization: A Primer on Making Informative and Compelling Figures

By Claus O. Wilke,

Book cover of Fundamentals of Data Visualization: A Primer on Making Informative and Compelling Figures

Why this book?

It is not enough for a data scientist to be able to analyze data and build ML models. You have to be able to communicate the insights to decision-makers concisely and accurately. This book shows you bad and good visualizations — you’ll be surprised by how often you would have defaulted to the bad way without the guidance provided by this book!


5 book lists we think you will like!

Interested in data science, social science, and statistics?

5,309 authors have recommended their favorite books and what they love about them. Browse their picks for the best books about data science, social science, and statistics.

Data Science Explore 21 books about data science
Social Science Explore 29 books about social science
Statistics Explore 16 books about statistics

And, 3 books we think you will enjoy!

We think you will like Social Sciences as Sorcery, Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals, and A First Course in Statistical Programming with R if you like this list.