Data Science For Dummies
Data Science For Dummies
Discover how data science can help you gain in-depth insight into your business – the easy way!
Jobs in data science abound, but few people have the data science skills needed to fill these increasingly important roles.
is the perfect starting point for IT professionals and students who want a quick primer covering all areas of the expansive data science space. With a focus on business cases, the book explores topics in big data, data science, and data engineering, and how these three areas are combined to produce tremendous value. If you want to pick-up the skills you need to begin a new career or initiate a new project, reading this book will help you understand what technologies, programming languages, and mathematical methods on which to focus. While this book serves as a wildly fantastic guide through the broad aspects of the topic, including the sometimes intimidating field of big data and data science, it is not an instructional manual for hands-on implementation. Here’s what to expect in Data Science For Dummies : Data Science for Dummies Provides a background in big data and data engineering before moving on to data science and how it’s applied to generate value. Includes coverage of big data frameworks and applications like Hadoop, MapReduce, Spark, MPP platforms, and NoSQL. Explains machine learning and many of its algorithms, as well as artificial intelligence and the evolution of the Internet of Things. Details data visualization techniques that can be used to showcase, summarize, and communicate the data insights you generate.
It’s a big, big data world out there – let
help you get started harnessing its power so you can gain a competitive edge for your organization. Data Science For Dummies
Download (pdf, 9.25 MB)
Send-to-Kindle or Email
You may be interested in
Most frequently terms
PDF, 2.22 MB
DOCX, 1.78 MB
Data Science For Dummies®
Published by: John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030-5774,
Copyright © 2015 by John Wiley & Sons, Inc., Hoboken, New Jersey
Published simultaneously in Canada
No part of this publication may be reproduced, stored in a retrieval system or
transmitted in any form or by any means, electronic, mechanical, photocopying,
recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the
1976 United States Copyright Act, without the prior written permission of the
Publisher. Requests to the Publisher for permission should be addressed to the
Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ
07030, (201) 748-6011, fax (201) 748-6008, or online at
Trademarks: Wiley, For Dummies, the Dummies Man logo, Dummies.com, Making
Everything Easier, and related trade dress are trademarks or registered trademarks of
John Wiley & Sons, Inc. and may not be used without written permission. All other
trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not
associated with any product or vendor mentioned in this book.
LIMIT OF LIABILITY/DISCLAIMER OF WARRANTY: THE PUBLISHER
AND THE AUTHOR MAKE NO REPRESENTATIONS OR WARRANTIES
WITH RESPECT TO THE ACCURACY OR COMPLETENESS OF THE
CONTENTS OF THIS WORK AND SPECIFICALLY DISCLAIM ALL
WARRANTIES, INCLUDING WITHOUT LIMITATION WARRANTIES OF
FITNESS FOR A PARTICULAR PURPOSE. NO WARRANTY MAY BE
CREATED OR EXTENDED BY SALES OR PROMOTIONAL MATERIALS.
THE ADVICE AND STRATEGIES CONTAINED HEREIN MAY NOT BE
SUITABLE FOR EVERY SITUATION. THIS WORK IS SOLD WITH THE
UNDERSTANDING THAT THE PUBLISHER IS NOT ENGAGED IN
RENDERING LEGAL, ACCOUNTING, OR OTHER PROFESSIONAL
SERVICES. IF PROFESSIONAL ASSISTANCE IS REQUIRED, THE
SERVICES OF A COMPETENT PROFESSIONAL PERSON SHOULD BE
SOUGHT. NEITHER THE PUBLISHER NOR THE AUTHOR SHALL BE
LIABLE FOR DAMAGES ARISING HEREFROM. THE FACT THAT AN
ORGANIZATION OR WEBSITE IS REFERRED TO IN THIS WORK AS A
CITATION AND/OR A POTENTIAL SOURCE OF FURTHER
INFORMATION DOES NOT MEAN THAT THE AUTHOR OR THE
PUBLISHER ENDORSES THE INFORMATION THE ORGANIZATION OR
WEBSITE MAY PROVIDE OR RECOMMENDATIONS IT MAY MAKE.
FURTHER, READERS SHOULD BE AWARE THAT INTERNET WEBSITES
LISTED IN THIS WORK MAY HAVE CHANGED OR DISAPPEARED
BETWEEN WHEN THIS WORK WAS WRITTEN AND WHEN IT IS READ.
For general information on our other products and services, please contact our
Customer Care Department within the U.S. at 877-762-2974, outside the U.S. at 317572-3993, or fax 317-572-4002. For technical support, please visit
Wiley publishes in a variety of print and electronic formats and by print-on-demand.
Some material included with standard print versions of this book may not be included
in e-books or in print-on-demand. If this book refers to media such as a CD or DVD
that is not included in the version you purchased, you may download this material at
http://booksupport.wiley.com. For more information about Wiley products, visit
Library of Congress Control Number: 2014955780
ISBN 978-1-118-4155-6 (pbk); ISBN 978-1-118-84145-7 (ebk); ISBN 978-1-11884152-5
Data Science For Dummies®
Visit www.dummies.com/cheatsheet/datascience to view this
book’s cheat sheet.
Table of Contents
About This Book
Icons Used in This Book
Beyond the Book
Where to Go from Here
Part I: Getting Started With Data Science
Chapter 1: Wrapping Your Head around Data Science
Seeing Who Can Make Use of Data Science
Looking at the Pieces of the Data Science Puzzle
Getting a Basic Lay of the Data Science Landscape
Chapter 2: Exploring Data Engineering Pipelines and
Defining Big Data by Its Four Vs
Identifying Big Data Sources
Grasping the Difference between Data Science and Data Engineering
Boiling Down Data with MapReduce and Hadoop
Identifying Alternative Big Data Solutions
Data Engineering in Action — A Case Study
Chapter 3: Applying Data Science to Business and Industry
Incorporating Data-Driven Insights into the Business Process
Distinguishing Business Intelligence and Data Science
Knowing Who to Call to Get the Job Done Right
Exploring Data Science in Business: A Data-Driven Business Success Story
Part II: Using Data Science to Extract Meaning from Your Data
Chapter 4: Introducing Probability and Statistics
Introducing the Fundamental Concepts of Probability
Introducing Linear Regression
Introducing Time Series Analysis
Chapter 5: Clustering and Classification
Introducing the Basics of Clustering and Classification
Identifying Clusters in Your Data
Chapter 6: Clustering and Classification with Nearest Neighbor
Making Sense of Data with Nearest Neighbor Analysis
Seeing the Importance of Clustering and Classification
Classifying Data with Average Nearest Neighbor Algorithms
Classifying with K-Nearest Neighbor Algorithms
Using Nearest Neighbor Distances to Infer Meaning from Point Patterns
Solving Real-World Problems with Nearest Neighbor Algorithms
Chapter 7: Mathematical Modeling in Data Science
Introducing Multi-Criteria Decision Making (MCDM)
Using Numerical Methods in Data Science
Mathematical Modeling with Markov Chains and Stochastic Methods
Chapter 8: Modeling Spatial Data with Statistics
Generating Predictive Surfaces from Spatial Point Data
Using Trend Surface Analysis on Spatial Data
Part III: Creating Data Visualizations that Clearly Communicate
Chapter 9: Following the Principles of Data Visualization Design
Understanding the Types of Visualizations
Focusing on Your Audience
Picking the Most Appropriate Design Style
Knowing When to Add Context
Knowing When to Get Persuasive
Choosing the Most Appropriate Data Graphic Type
Choosing Your Data Graphic
Chapter 10: Using D3.js for Data Visualization
Introducing the D3.js Library
Knowing When to Use D3.js (and When Not To)
Getting Started in D3.js
Understanding More Advanced Concepts and Practices in D3.js
Chapter 11: Web-Based Applications for Visualization Design
Using Collaborative Data Visualization Platforms
Visualizing Spatial Data with Online Geographic Tools
Visualizing with Open Source: Web-Based Data Visualization Platforms
Knowing When to Stick with Infographics
Chapter 12: Exploring Best Practices in Dashboard Design
Focusing on the Audience
Starting with the Big Picture
Getting the Details Right
Testing Your Design
Chapter 13: Making Maps from Spatial Data
Getting into the Basics of GIS
Analyzing Spatial Data
Getting Started with Open-Source QGIS
Part IV: Computing for Data Science
Chapter 14: Using Python for Data Science
Understanding Basic Concepts in Python
Getting on a First-Name Basis with Some Useful Python Libraries
Using Python to Analyze Data — An Example Exercise
Chapter 15: Using Open Source R for Data Science
Introducing the Fundamental Concepts
Previewing R Packages
Chapter 16: Using SQL in Data Science
Getting Started with SQL
Using SQL and Its Functions in Data Science
Chapter 17: Software Applications for Data Science
Making Life Easier with Excel
Using KNIME for Advanced Data Analytics
Part V: Applying Domain Expertise to Solve Real-World Problems
Using Data Science
Chapter 18: Using Data Science in Journalism
Exploring the Five Ws and an H
Collecting Data for Your Story
Finding and Telling Your Data’s Story
Bringing Data Journalism to Life: Washington Post’s The Black Budget
Chapter 19: Delving into Environmental Data Science
Modeling Environmental-Human Interactions with Environmental Intelligence
Modeling Natural Resources in the Raw
Using Spatial Statistics to Predict for Environmental Variation across Space
Chapter 20: Data Science for Driving Growth in E-Commerce
Making Sense of Data for E-Commerce Growth
Optimizing E-Commerce Business Systems
Chapter 21: Using Data Science to Describe and Predict Criminal
Temporal Analysis for Crime Prevention and Monitoring
Spatial Crime Prediction and Monitoring
Probing the Problems with Data Science for Crime Analysis
Part VI: The Part of Tens
Chapter 22: Ten Phenomenal Resources for Open Data
Digging through Data.gov
Checking Out Canada Open Data
Diving into data.gov.uk
Checking Out U.S. Census Bureau Data
Knowing NASA Data
Wrangling World Bank Data
Getting to Know Knoema Data
Queuing Up with Quandl Data
Exploring Exversion Data
Mapping OpenStreetMap Spatial Data
Chapter 23: Ten (or So) Free Data Science Tools and Applications
Making Custom Web-Based Data Visualizations with Free R Packages
Checking Out More Scraping, Collecting, and Handling Tools
Checking Out More Data Exploration Tools
Checking Out More Web-Based Visualization Tools
About the Author
Connect with Dummies
End User License Agreement
We live in exciting, even revolutionary times. As our daily interactions move from the
physical world to the digital world, nearly every action we take generates data.
Information pours from our mobile devices and our every online interaction. Sensors
and machines collect, store and process information about the environment around us.
New, huge data sets are now open and publicly accessible.
This flood of information gives us the power to make more informed decisions, react
more quickly to change, and better understand the world around us. However, it can be
a struggle to know where to start when it comes to making sense of this data deluge.
What data should one collect? What methods are there for reasoning from data? And,
most importantly, how do we get the answers from the data to answer our most
pressing questions about our businesses, our lives, and our world?
Data science is the key to making this flood of information useful. Simply put, data
science is the art of wrangling data to predict our future behavior, uncover patterns to
help prioritize or provide actionable information, or otherwise draw meaning from
these vast, untapped data resources.
I often say that one of my favorite interpretations of the word “big” in Big Data is
“expansive.” The data revolution is spreading to so many fields that it is now
incumbent on people working in all professions to understand how to use data, just as
people had to learn how to use computers in the 80’s and 90’s. This book is designed to
help you do that.
I have seen firsthand how radically data science knowledge can transform
organizations and the world for the better. At DataKind, we harness the power of data
science in the service of humanity by engaging data science and social sector experts to
work on projects addressing critical humanitarian problems. We are also helping drive
the conversation about how data science can be applied to solve the world’s biggest
challenges. From using satellite imagery to estimate poverty levels to mining decades
of human rights violations to prevent further atrocities, DataKind teams have worked
with many different nonprofits and humanitarian organizations just beginning their data
science journeys. One lesson resounds through every project we do: The people and
organizations that are most committed to using data in novel and responsible ways are
the ones who will succeed in this new environment.
Just holding this book means you are taking your first steps on that journey, too.
Whether you are a seasoned researcher looking to brush up on some data science
techniques or are completely new to the world of data, Data Science For Dummies will
equip you with the tools you need to show whatever you can dream up. You’ll be able
to demonstrate new findings from your physical activity data, to present new insights
from the latest marketing campaign, and to share new learnings about preventing the
spread of disease.
We truly are on the forefront of a new data age, and those that learn data science will be
able to take part in this thrilling new adventure, shaping our path forward in every
field. For you, that adventure starts now. Welcome aboard!
Founder and Executive Director of DataKind™
The power of big data and data science are revolutionizing the world. From the modern
business enterprise to the lifestyle choices of today’s digital citizen, data science
insights are driving changes and improvements in every arena. Although data science
may be a new topic to many, it’s a skill that any individual who wants to stay relevant
in her career field and industry needs to know.
Although other books dealing with data science tend to focus heavily on using
Microsoft Excel to learn basic data science techniques, Data Science For Dummies
goes deeper by introducing Python, the R statistical programming language, D3.js,
SQL, Excel, and a whole plethora of open-source applications that you can use to get
started in practicing data science. Some books on data science are needlessly wordy,
with authors going in circles trying to get to a point. Not so here. Unlike books
authored by stuffy-toned, academic types, I’ve written this book in friendly,
approachable language — because data science is a friendly and approachable subject!
To be honest, up until now, the data science realm has been dominated by a few select
data science wizards who tend to present the topic in a manner that’s unnecessarily
over-technical and intimidating. Basic data science isn’t that hard or confusing. Data
science is simply the practice of using a set of analytical techniques and methodologies
to derive and communicate valuable and actionable insights from raw data. The
purpose of data science is to optimize processes and to support improved data-informed
decision making, thereby generating an increase in value — whether value is
represented by number of lives saved, number of dollars retained, or percentage of
revenues increased. In Data Science For Dummies, I introduce a broad array of
concepts and approaches that you can use when extracting valuable insights from your
Remember, a lot of times data scientists get so caught up analyzing the bark of the trees
that they simply forget to look for their way out of the forest. This is a common pitfall
that you should avoid at all costs. I’ve worked hard to make sure that this book presents
the core purpose of each data science technique and the goals you can accomplish by
About This Book
In keeping with the For Dummies brand, this book is organized in a modular, easy-toaccess format. This format allows you to use the book as a practical guidebook and ad
hoc reference. In other words, you don’t need to read through, cover to cover. Just take
what you want and leave the rest. I’ve taken great care to use real-world examples that
illustrate data science concepts that may otherwise be overly abstract.
Web addresses and programming code appear in monofont. If you’re reading a digital
version of this book on a device connected to the Internet, you can click a web address
to visit that website, like this: www.dummies.com.
In writing this book, I’ve assumed that readers are at least technical enough to have
mastered advanced Microsoft Excel — pivot tables, grouping, sorting, plotting, and the
like. Being strong in algebra, basic statistics, or even business calculus helps, as well.
Foolish or not, it’s my high hope that all readers have a subject-matter expertise to
which they can apply the skills presented in this book. Since data scientists must be
capable of intuitively understanding the implications and applications of the data
insights they derive, subject-matter expertise is a major component of data science.
Icons Used in This Book
As you make your way through this book, you’ll see the following icons in the
The Tip icon marks tips (duh!) and shortcuts that you can use to make subject
Remember icons mark the information that’s especially important to know. To
siphon off the most important information in each chapter, just skim through these
The Technical Stuff icon marks information of a highly technical nature that
you can normally skip over.
The Warning icon tells you to watch out! It marks important information that
may save you headaches.
Beyond the Book
This book includes the following external resources:
Data Science Cheat Sheet: This book comes with a handy Cheat Sheet at
www.dummies.com/cheatsheet/datascience. The Cheat Sheet lists helpful
shortcuts, as well as abbreviated definitions for essential processes and concepts
described in the book. You can use it as a quick-and-easy reference when doing
Online articles on the practical application of data science: This book has Parts
pages that link to www.dummies.com, where you can find a number of articles that
extend the topics covered. More specifically, these articles present best practices,
how-to’s, and case studies that exemplify the power of data science in practice. The
articles are available on the book’s Extras page
Updates: I’ll be updating this book on a regular basis. You can find updates on the
Downloads tab of the book's product page. On the book’s Extras page
(www.dummies.com/extras/datascience), an article will either describe the update
or provide a link to take readers to the Downloads tab for access to updated content.
Any errata will appear in this section, as well.
Where to Go from Here
Just to reemphasize the point, this book’s modular design allows you to pick up and
start reading anywhere you want. Although you don’t need to read cover to cover, a
few good starter chapters include Chapters 1, 2, and 9.
Getting Started With Data Science
For great online content, check out http://www.dummies.com.
In this part …
Get introduced to the field of data science.
Define big data.
Explore solutions for big data problems.
See how a real-world businesses put data science to good use.
Wrapping Your Head around Data
In This Chapter
Defining data science
Defining data science by its key components
Identifying viable data science solutions to your own data challenges
For quite some time now, we’ve all been absolutely deluged by data. It’s coming off of
every computer, every mobile device, every camera, and every sensor — and now it’s
even coming off of watches and other wearable technologies. It’s generated in every
social media interaction we make, every file we save, every picture we take, every
query we submit; it’s even generated when we do something as simple as get directions
to the closest ice cream shop from Google.
Although data immersion is nothing new, you may have noticed that the phenomenon
is accelerating. Lakes, puddles, and rivers of data have turned to floods and veritable
tsunamis of structured, semi-structured, and unstructured data that’s streaming from
almost every activity that takes place in both the digital and physical worlds. Welcome
to the world of big data!
If you’re anything like me, then you may have wondered, “What’s the point of all this
data? Why use valuable resources to generate and collect it?” Although even one
decade ago, no one was in a position to make much use of most of the data generated,
the tides today have definitely turned. Specialists known as data engineers are
constantly finding innovative and powerful new ways to capture, collate, and condense
unimaginably massive volumes of data, and other specialists known as data scientists
are leading change by deriving valuable and actionable insights from that data.
In its truest form, data science represents process and resource optimization. Data
science produces data insights — insights you can use to understand and improve your
business, your investments, your health, and even your lifestyle and social life. Using
data science is like being able to see in the dark. For any goal or pursuit you can
imagine, you can find data science methods to help you know and predict the most
direct route from where you are to where you want to be — and anticipate every
pothole in the road in between.
Seeing Who Can Make Use of Data Science
The terms data science and data engineering are often misused and confused, so let me
start off right here by clarifying that these two fields are, in fact, separate and distinct
domains of expertise. Data science is the practice of using computational methods to
derive valuable and actionable insights from raw datasets. Data engineering, on the
other hand, is an engineering domain that’s dedicated to overcoming data-processing
bottlenecks and data-handling problems for applications that utilize large volumes,
varieties, and velocities of data. In both data science and data engineering, it’s common
to work with the following three data varieties:
Structured data: Data that’s stored, processed, and manipulated in a traditional
relational database management system.
Unstructured data: Data that’s commonly generated from human activities and
that doesn’t fit into a structured database format.
Semi-structured data: Data that doesn’t fit into a structured database system, but
is nonetheless structured by tags that are useful for creating a form of order and
hierarchy in the data.
A lot of people think only large organizations that have massive funding are
implementing data science methodologies to optimize and improve their business, but
that’s not the case. The proliferation of data has created a demand for insights, and this
demand is embedded in many aspects of our modern culture. Data and the need for data
insights are ubiquitous. Because organizations of all sizes are beginning to recognize
that they’re immersed in a sink-or-swim, data-driven, competitive environment, data
know-how emerges as a core and requisite function in almost every line of business.
So, what does this mean for the everyday person? First off, it means that our culture
has changed, and you have to keep up. It doesn’t, however, mean that you must go back
to school and complete a degree in statistics, computer science, or data science. In this
respect, the data revolution isn’t so different from any other change that’s hit industry
in the past. The fact is, in order to stay relevant, you only need to take the time and
effort to acquire the skills that keep you current. When it comes to learning how to do
data science, you can take some courses, educate yourself through online resources,
read books like this one, and attend events where you can learn what you need to know
to stay on top of the game.
Who can use data science? You can. Your organization can. Your employer can.
Anyone who has a bit of understanding and training can begin using data insights to
improve their lives, their careers, and the well-being of their businesses. Data science
represents a change in the way you approach the world. People used to act and hope for
an outcome, but data insights provide the vision people need to drive change and to
make good things happen. You can use data insights to bring about the following types
Optimize business systems and returns on investment (those crucial ROIs) for any
Improve the effectiveness of sales and marketing initiatives — whether that be part
of an organizational marketing campaign or simply a personal effort to secure
better employment opportunities for yourself.
Keep ahead of the pack on the very latest developments in every arena.
Keep communities safer.
Help make the world a better place for those less fortunate.
Looking at the Pieces of the Data Science
To practice data science, in the true meaning of the term, you need the analytical knowhow of math and statistics, the coding skills necessary to work with data, and an area of
subject-matter expertise. Without subject-matter expertise, you might as well call
yourself a mathematician or a statistician. Similarly, a software programmer without
subject-matter expertise and analytical know-how might better be considered a
software engineer or developer, but not a data scientist.
Because the demand for data insights is increasing exponentially, every area is forced
to adopt data science. As such, different flavors of data science have emerged. The
following are just a few titles under which experts of every discipline are using data
science — Ad Tech Data Scientist, Director of Banking Digital Analyst, Clinical Data
Scientist, Geo-Engineer Data Scientist, Geospatial Analytics Data Scientist, Retail
Personalization Data Scientist, and Clinical Informatics Analyst in Pharmacometrics.
Given the fact that it often seems you can’t keep track of who’s a data scientist without
a scorecard, in the following sections I take the time to spell out the key components
that would be part of any data science role.
Collecting, querying, and consuming data
Data engineers have the job of capturing and collating large volumes of structured,
unstructured, and semi-structured big data — data that exceeds the processing capacity
of conventional database systems because it’s too big, it moves too fast, or it doesn’t fit
the structural requirements of traditional database architectures. Again, data
engineering tasks are separate from the work that’s performed in data science, which
focuses more on analysis, prediction, and visualization. Despite this distinction, when a
data scientist collects, queries, and consumes data during the analysis process, he or she
performs work that’s very similar to that of a data engineer.
Although valuable insights can be generated from a single data source, oftentimes the
combination of several relevant sources delivers the contextual information required to
drive better data-informed decisions. A data scientist can work off of several datasets
that are stored in one database, or even in several different data warehouses. (For more
on working with combined datasets, see Chapter 3.) Other times, source data is stored
and processed on a cloud-based platform that’s been built by software and data
No matter how the data is combined or where it’s stored, if you’re doing data science,
you almost always have to query the data — write commands to extract relevant
datasets from the data storage system, in other words. Most of the time, you use
Structured Query Language (SQL) to query data. (Chapter 16 is all about SQL, so if
the acronym scares you, jump ahead to that chapter right now.) Whether you’re using
an application or doing custom analyses by using a programming language such as R or
Python, you can choose from a number of universally-accepted file formats. Those
Comma-separated values (CSV) files: Almost every brand of desktop and webbased analysis application accepts this file type, as do commonly-used scripting
languages such as Python and R.
Scripts: Most data scientists know how to use Python and R programming
languages to analyze and visualize data. These script files end with the extensions
.py and .r, respectively.
Application files: Excel is useful for quick-and-easy, spot-check analyses on
small- to medium-sized datasets. These application files have a .xls or .xlsx
extension. Geospatial analysis applications such as ArcGIS and QGIS save with
their own proprietary file formats (the .mxd extension for ArcGIS and the .qgs
extension for QGIS).
Web programming files: If you’re building custom web-based data visualizations,
library for data visualization. When working in D3.js, your work will be saved in
Making use of math and statistics
Data science relies heavily on a practitioner’s math and statistics skills precisely
because these are the skills needed in order to understand your data and its
significance. The skills are also valuable in data science because you can use them to
carry out predictive forecasting, decision modeling, and hypotheses testing.
Before launching into more detailed discussions on mathematical and statistical
methods, it’s important to stop right here and clearly explain the difference between the
fields of math and statistics. Mathematics uses deterministic numerical methods and
deductive reasoning to form a quantitative description of the world, while statistics is a
form of science that’s derived from mathematics, but that focuses on using a stochastic
approach — an approach based on probabilities — and inductive reasoning to form a
quantitative description of the world.
Applying mathematical modeling to data science tasks
Data scientists use mathematical methods to build decision models, to generate
approximations, and to make predictions about the future. Chapter 7 presents some
complex applied mathematical approaches that are useful when working in data
This book assumes that you’ve got a fairly solid skillset in basic math — it
would be advantageous if you’ve taken college-level calculus or even linear
algebra. I took great efforts, however, to meet readers where they are. I realize that
you may be working based on a limited mathematical knowledge (advanced
algebra or maybe business calculus), so I’ve tried to convey advanced
mathematical concepts using a plain-language approach that’s easy for everyone
Using statistical methods to derive insights
In data science, statistical methods are useful for getting a better understanding of your
data’s significance, for validating hypotheses, for simulating scenarios, and for making
predictive forecasts of future events. Advanced statistical skills are somewhat rare,
even among quantitative analysts, engineers, and scientists. If you want to go places in
data science though, take some time to get up to speed in a few basic statistical
methods, like linear regression, ordinary least squares regression, Monte Carlo
simulations, and time series analysis. Each of these techniques is defined and described
in Chapter 4’s discussion on probability and statistics.
The good news is that you don’t have to know everything — it’s not like you need to
go out and get a master’s degree in statistics to do data science. You need to know just
a few fundamental concepts and approaches from statistics to solve problems. (I cover
several of these concepts and approaches in Chapter 4.)
Coding, coding, coding … it’s just part of the game
Coding is unavoidable when you’re working in data science. You need to be able to
write code so that you can instruct the computer how you want it to manipulate,
analyze, and visualize your data. Programming languages such as Python and R are
important for writing scripts for data manipulation, analysis, and visualization, and
making really cool custom interactive web-based data visualizations.
Although coding is a requirement for data science, it really doesn’t have to be this big
scary thing people make it out to be. Your coding can be as fancy and complex as you
want it to be, but you can also take a rather simple approach. Although these skills are
paramount to success, you can pretty easily learn enough coding to practice high-level
data science. I’ve dedicated Chapters 10, 14, 15, and 16 to helping you get up to speed
in using D3.js for web-based data visualization, coding in Python and in R, and
querying in SQL.
Applying data science to your subject area
There has been some measure of obstinacy from statisticians when it comes to
accepting the significance of data science. Many statisticians have cried out, “Data
science is nothing new! It’s just another name for what we’ve been doing all along.”
Although I can sympathize with their perspective, I’m forced to stand with the camp of
data scientists that markedly declare that data science is separate and definitely distinct
from the statistical approaches that comprise it.
My position on the unique nature of data science is to some extent based on the fact
that data scientists often use computer languages not used in traditional statistics and
utilize approaches derived from the field of mathematics. But the main point of
distinction between statistics and data science is the need for subject-matter expertise.
Due to the fact that statisticians usually have only a limited amount of expertise in
fields outside of statistics, they’re almost always forced to consult with a subject-matter
expert to verify exactly what their findings mean and to decide the best direction in
which to proceed. Data scientists, on the other hand, are required to have a strong
subject-matter expertise in the area in which they’re working. Data scientists generate
deep insights and then use their domain-specific expertise to understand exactly what
those insights mean with respect to the area in which they’re working. The list below
shows a few ways in which subject matter experts are using data science to enhance
performance in their respective industries:
Engineers use machine learning to optimize energy efficiency in modern building
Clinical data scientists work on the personalization of treatment plans and use
healthcare informatics to predict and preempt future health problems in at-risk
Marketing data scientists use logistic regression to predict and preempt customer
churn (the loss or churn of customers from your product or service, to that of a
competitor’s, in other words) — more on decreasing customer churn is covered in
Chapter 3 and Chapter 20.
Data journalists scrape websites for fresh data in order to discover and report the
latest breaking news stories (I talk more about data journalism in Chapter 18).
Data scientists in crime analysis use spatial predictive modeling to predict,
preempt, and prevent criminal activities (see Chapter 21 for all the details on using
data science to describe and predict criminal activity).
Data do-gooders use machine learning to classify and report vital information about
disaster-affected communities for real-time decision support in humanitarian
response, which you can read about in Chapter 19.
Communicating data insights
Another skill set is paramount to a data scientist’s success (and may not be immediately
obvious): As a data scientist, you must have sharp oral and written communication
skills. If a data scientist can’t communicate, all the knowledge and insight in the world
will do nothing for your organization. Data scientists need to be able to explain data
insights in a way that staff members can understand. Not only that, they need to be able
to produce clear and meaningful data visualizations and written narratives. Most of the
time, people need to see something for themselves in order to understand. Data
scientists must be creative and pragmatic in their means and methods of
communication. (I cover the topics of data visualization and data-driven storytelling in
much greater detail in Chapter 9 and Chapter 18.)
Getting a Basic Lay of the Data Science
Organizations and their leaders are still grappling with how to best use big data and
data science. Most of them know that advanced analytics is positioned to bring a
tremendous competitive edge to their organization, but very few have any idea about
the options that are available or the exact benefits that data science can deliver. In the
following sections, I introduce the major data science solution alternatives and the
benefits that a data science implementation can deliver.
Exploring data science solution alternatives
When looking to implement data science across an organization, or even just across a
department, three main approaches are available: You can build an in-house data
science team, outsource the work to external data scientists, or use a cloud-based
solution that can deliver the power of data analytics to professionals who have only a
modest level of data literacy.
Building your own in-house team
Here are three options for building an in-house data science team:
Train existing employees. This is a lower-cost alternative. If you want to equip
your organization with the power of data science and analytics, then data science
training can transform existing staff into data-skilled, highly specialized subjectmatter experts for your in-house team.
Train existing employees and hire some experts. Another good option is to train
existing employees to do high-level data science tasks, and also bring on a few new
hires to fulfill your more advanced data science problem-solving and strategy
Hire experts. Some organizations try to fill their requirements by hiring advanced
data scientists or fresh graduates with degrees in data science. The problem with
this approach is that there aren’t enough of these people to go around, and if you do
find someone who’s willing to come onboard, he or she is going to have very high
salary requirements. Remember, in addition to the math, statistics, and coding
requirements, data scientists must also have a high level of subject matter expertise
in the specific field where they’re working. That’s why it’s extraordinarily difficult
to find these individuals. Until universities make data-literacy an integral part of
every educational program, finding highly specialized and skilled data scientists to
satisfy organizational requirements will be nearly impossible.
Outsourcing requirements to private data science consultants
Many organizations prefer to outsource their data science and analytics requirements to
an outside expert. There are two general routes: Outsource for the development of a
comprehensive data science strategy that serves your entire organization, or outsource
for piecemeal, individual data science solutions to specific problems that arise, or have
arisen, within your organization.
Outsourcing for comprehensive data science strategy development
If you want to build an advanced data science implementation for your organization,
you can hire a private consultant to help you with a comprehensive strategy
development. This type of service is likely going to cost you, but you can receive
tremendously valuable insights in return. A strategist will know about the options
available to meet your requirements, as well as the benefits and drawbacks of each.
With strategy in-hand and an on-call expert available to help you, you can much more
easily navigate the task of building an internal team.
Outsourcing for data science solutions to specific problems
If you’re not prepared for the rather involved process of comprehensive strategy design
and implementation, you have the option to contract smaller portions of work out to a
private data science consultant. This spot-treatment approach could still deliver the
benefits of data science without requiring you to reorganize the structure and financials
of your entire organization.
Leveraging cloud-based platform solutions
Some have seen the explosion of big data and data science coming from a long way off.
Although it’s still new to most, professionals and organizations in the know have been
working fast and furious to prepare. A few organizations have expended great effort
and expense to develop data science solutions that are accessible to all. Cloud
applications such as IBM’s Watson Analytics (www.ibm.com/analytics/watsonanalytics) offers users code-free, automated data services — from cleanup and
statistical modeling to analysis and data visualization. Although you still need to
understand the statistical, mathematical, and substantive relevance of the data insights,
applications such as Watson Analytics can deliver some powerful results without
requiring users to know how to write code or scripts.
If you decide to use cloud-based platform solutions to help your organization
reach its data science objectives, remember that you’ll still need in-house staff
who are trained and skilled to design, run, and interpret the quantitative results
from these platforms. The platform will not do away with the need for in-house
training and data science expertise — it will merely augment your organization so
that it can more readily achieve its objectives.
Identifying the obvious wins
Through this book, I hope to show you the power of data science and how you can use
that power to more quickly reach your personal and professional goals. No matter the
sector in which you work, acquiring data science skills can transform you into a more
marketable professional. The following is just a small list of benefits that data science
and analytics deliver across key industry sectors:
Benefits for corporations, small and medium-sized enterprises (SMEs), and ecommerce businesses: Production-costs optimization, sales maximization,
marketing ROI increases, staff-productivity optimization, customer-churn
reduction, customer lifetime-value increases, inventory requirements and sales
predictions, pricing-model optimization, fraud detection, and logistics
Benefits for governments: Business-process and staff-productivity optimization,
management decision-support enhancements, finance and budget forecasting,
expenditure tracking and optimization, and fraud detection
Benefits for academia: Resource-allocation improvements, student performance
management improvements, drop-out reductions, business-process optimization,
finance and budget forecasting, and recruitment ROI increases
Exploring Data Engineering Pipelines and
In This Chapter
Defining big data
Looking at some sources of big data
Distinguishing between data science and data engineering
Exploring solutions for big data problems
Checking out a real-world data engineering project
There’s a lot of hype around big data these days, but most people don’t really know or
understand what it is or how they can use it to improve their lives and livelihoods. This
chapter defines the term big data, explains where big data comes from and how it’s
used, and outlines the roles that data engineers and data scientists play in the big data
ecosystem. In this chapter, I introduce the fundamental big data concepts that you need
in order to start generating your own ideas and plans on how to leverage big data and
data science to improve your life and business workflow.
Defining Big Data by Its Four Vs
Big data is data that exceeds the processing capacity of conventional database systems
because it’s too big, it moves too fast, or it doesn’t fit the structural requirements of
traditional database architectures. Whether data volumes rank in the terabyte or
petabyte scales, data engineering solutions must be designed to meet requirements for
the data’s intended destination and use.
When you’re talking about regular data, you’re likely to hear the words
kilobyte or gigabyte used — 103 and 109 bytes, respectively. In contrast, when
you’re talking about big data, words like terabyte and petabyte are thrown around
— 1012 and 1015 bytes, respectively. A byte is an 8-bit unit of data.
Four characteristics (the four Vs) define big data: volume, velocity, variety, and value.
Since the four Vs of big data are continually growing, newer, more innovative data
technologies must continuously be developed to manage big data problems.
Whenever you’re in doubt, use the 4V criteria to determine whether you have a
big-data or regular-data problem on your hands.
Grappling with data volume
The lower limits of big data volumes range between a few terabytes, up to tens of
petabytes, on an annual basis. The volume numbers by which a big data set is defined
have no upper limit. In fact, the volumes of most big data sets are increasing
exponentially on a yearly basis.
Handling data velocity
Automated machinery and sensors are generating high-velocity data on a continual
basis. In engineering terms, data velocity is data volume per unit time. Big data
velocities range anywhere between 30 kilobytes (K) per second up to even 30
gigabytes (GB) per second. High-velocity, real-time data streams present an obstacle to
timely decision making. The capabilities of data-handling and data-processing
technologies often limit data velocities.
Dealing with data variety
Big data makes everything more complicated by adding unstructured and semistructured data in with the structured datasets. This is called high-variety data. Highvariety data sources can be derived from data streams that are generated from social
networks or from automated machinery.
Structured data is data that can be stored, processed, and manipulated in a traditional
relational database management system. This data can be generated by humans or
machines, and is derived from all sorts of sources, from click-streams and web-based
forms to point of sale transactions and sensors. Unstructured data comes completely
unstructured — it’s commonly generated from human activities and doesn’t fit into a
structured database format. Such data could be derived from blog posts, emails, and
Word documents. Semi-structured data is data that doesn’t fit into a structured database
system, but is nonetheless structured by tags that are useful for creating a form of order
and hierarchy in the data. Semi-structured data is commonly found in database and file
systems. It can be stored as log files, XML files, or JSON data files.
Creating data value
In its raw form, most big data is low value — in other words, the value-to-data quantity
ratio is low in raw big data. Big data is comprised of huge numbers of very small
transactions that come in a variety of formats. These incremental components of big
data produce true value only after they're rolled up and analyzed. Data engineers have
the job of rolling it up and data scientists have the job of analyzing it.
Identifying Big Data Sources
Big data is being generated by humans, machines, and sensors everywhere, on a
continual basis. Some typical sources include data from social media, financial
transactions, health records, click-streams, log files, and the internet of things — a web
of digital connections that joins together the ever-expanding array of electronic devices
we use in our everyday lives. Figure 2-1 shows a variety of popular big data sources.
Figure 2-1: A diagram of popular big data sources.
Grasping the Difference between Data
Science and Data Engineering
Data science and data engineering are two different branches within the big data
paradigm — an approach wherein huge velocities, varieties, and volumes of structured,
unstructured, and semi-structured data are being captured, processed, stored, and
analyzed using a set of techniques and technologies that is completely novel compared
to those that were used in decades past.
Both are useful for deriving knowledge and actionable insights out of raw data. Both
are essential elements for any comprehensive decision-support system, and both are
extremely helpful when formulating robust strategies for future business management
and growth. Although the terms data science and data engineering are often used
interchangeably, they’re definitely distinct domains of expertise. In the following
sections, I introduce concepts that are fundamental to data science and data
engineering, and then I show you the differences between how these two roles function
in an organization's data processing system.
Defining data science
If science is a systematic method by which people study and explain domain-specific
phenomenon that occur in the natural world, then you can think of data science as the
scientific domain that’s dedicated to knowledge discovery via data analysis.
With respect to data science, the term domain-specific refers to the industry
sector or subject matter domain that data science methods are being used to
Data scientists use mathematical techniques and algorithmic approaches to derive
solutions to complex business and scientific problems. Data science practitioners use
its methods to derive insights that are otherwise unattainable. Both in business and in
science, data science methods can provide more robust decision making capabilities. In
business, the purpose of data science is to empower businesses and organizations with
the data information that they need to optimize organizational processes for maximum
efficiency and revenue generation. In science, data science methods are used to derive
results and develop protocols for achieving the specific scientific goal at hand.
Data science is a vast and multi-disciplinary field. To truly call yourself a data scientist,
you need to have expertise in math and statistics, computer programming, and your
own domain-specific subject matter.
Using data science skills, you can do things like
Use machine learning to optimize energy usages and lower corporate carbon
Optimize tactical strategies to achieve goals in business and in science.
Predict for unknown contaminant levels from sparse environmental datasets.
Design automated theft and fraud prevention systems to detect anomalies and
trigger alarms based on algorithmic results.
Craft site-recommendation engines for use in land acquisitions and real estate
Implement and interpret predictive analytics and forecasting techniques for net
increases in business value.
Data scientists must have extensive and diverse quantitative expertise to be able to
solve these types of problems.
Machine learning is the practice of applying algorithms to learn from and
make automated predictions about data.
Defining data engineering
If engineering is the practice of using science and technology to design and build
systems that solve problems, then you can think of data engineering as the engineering
domain that’s dedicated to overcoming data-processing bottlenecks and data-handling
problems for applications that utilize big data.
Data engineers use skills in computer science and software engineering to design
systems for, and solve problems with, handling and manipulating big data sets. Data
engineers have experience working with and designing real-time processing
frameworks and Massively Parallel Processing (MPP) platforms, as well as relational
database management systems. They generally code in Java, C++, and Python. They
know how to deploy Hadoop or MapReduce to handle, process, and refine big data into
more manageably sized datasets. Simply put, with respect to data science, the purpose
of data engineering is to engineer big data solutions by building coherent, modular, and
scalable data processing platforms from which data scientists can subsequently derive
Most engineered systems are built systems — systems that are constructed or
manufactured in the physical world. Data engineering is different, though. It
involves designing, building, and implementing software solutions to problems in
the data world — a world that can seem pretty abstract when compared to the
physical reality of the Golden Gate Bridge or the Aswan Dam.
Using data engineering skills, you can do things like
Build large-scale Software as a Service (SaaS) applications.
Build and customize Hadoop and MapReduce applications.
Design and build relational databases and highly scaled distributed architectures for
processing big data.
Extract, transform, and load (ETL) data from one database into another.
Data engineers need solid skills in computer science, database design, and software
engineering to be able to perform this type of work.
Software as a Service (SaaS) is a term that describes cloud-hosted software
services that are made available to users via the Internet.
Comparing data scientists and data engineers
The roles of data scientist and data engineer are frequently completely confused and
intertwined by hiring managers. If you look around at most position descriptions for
companies that are hiring, they often mismatch the titles and roles, or simply expect
applicants to do both data science and data engineering.
If you’re hiring someone to help you make sense of your data, be sure to define
your requirements very clearly before writing the position description. Since a
data scientist must also have subject matter expertise in the particular area in
which they work, this requirement generally precludes a data scientist from also
having expertise in data engineering (although some data scientists do have
experience using engineering data platforms). And if you hire a data engineer who
has data-science skills, he or she generally won’t have much subject-matter
expertise outside of the data domain. Be prepared to call in a subject-matter expert
to help him or her.
Because so many organizations combine and confuse roles in their data projects, data
scientists are sometime stuck spending a lot of time learning to do the job of a data
engineer, and vice versa. To get the highest-quality work product in the least amount of
time, hire a data engineer to process your data and a data scientist to make sense of it
Lastly, keep in mind that data engineers and data scientists are just two small roles
within a larger organizational structure. Managers, middle-level employees, and
organizational leaders also play a huge part in the success of any data-driven initiative.
The primary benefit of incorporating data science and data engineering into your
projects is to leverage your external and internal data to strengthen your organization’s
Boiling Down Data with MapReduce and
Because big data’s four Vs (volume, velocity, variety, and value) don’t allow for the
handling of big data using traditional relational database management systems, data
engineers had to get innovative. To get around the limitations of relational database
management systems, data engineers use the Hadoop data-processing platform (with its
MapReduce programming paradigm) to boil big data down to smaller datasets that are
more manageable for data scientists to analyze. As a distributed, parallel processing
framework, MapReduce results in data processing speeds that are lightning-fast
compared to those available through traditional methods.
In the following sections, I introduce you to the MapReduce paradigm, Hadoop dataprocessing platforms, and the components that comprise Hadoop. I also introduce the
programming languages that you can use to develop applications in these frameworks.
Digging into MapReduce
MapReduce is a programming paradigm that was designed to allow parallel distributed
processing of large sets of data, converting them to sets of tuples, and then combining
and reducing those tuples into smaller sets of tuples. In layman’s terms, MapReduce
was designed to take big data and use parallel distributed computing to turn big data
into little- or regular- sized data.
Parallel distributed processing refers to a powerful framework where mass
volumes of data are processed very quickly by distributing processing tasks across
clusters of commodity servers. With respect to MapReduce, tuples refer to keyvalue pairs by which data is grouped, sorted, and processed.
MapReduce jobs work via map and reduce process operation sequences across a
distributed set of servers. In the map task, you delegate your data to key-value pairs,
transform it, and filter it. Then you assign the data to nodes for processing. In the
reduce task, you aggregate that data down to smaller sized datasets. Data from the
reduce step is transformed into a standard key-value format — where the key acts as the
record identifier and the value is the value that’s being identified by the key. The
clusters’ computing nodes process the map and reduce tasks that are defined by the
user. This work is done in accordance with the following two steps:
1. Map the data. The incoming data must first be delegated into key-value pairs and
divided into fragments, which are then assigned to map tasks. Each computing
cluster — a group of nodes that are connected to each other and perform a shared
computing task — is assigned a number of map tasks, which are subsequently
distributed among its nodes. Upon processing of the key-value pairs, intermediate
key-value pairs are generated. The intermediate key-value pairs are sorted by their
key values, and this list is divided into a new set of fragments. Whatever count you
have for these new fragments, it will be the same as the count of the reduce tasks.
2. Reduce the data. Every reduce task has a fragment assigned to it. The reduce task
simply processes the fragment and produces an output, which is also a key-value
pair. Reduce tasks are also distributed among the different nodes of the cluster.
After the task is completed, the final output is written onto a file system.
In short, you can quickly and efficiently boil down and begin to make sense of a huge
volume, velocity, and variety of data by using map and reduce tasks to tag your data by
(key, value) pairs, and then reduce those pairs into smaller sets of data through
aggregation operations — operations that combine multiple values from a dataset into
a single value. A diagram of the MapReduce architecture can be found in Figure 2-2.
Figure 2-2: A diagram of a MapReduce architecture.
If your data doesn’t lend itself to being tagged and processed via keys, values,
and aggregation, then map and reduce generally isn’t a good fit for your needs.
If you’re using MapReduce as part of a Hadoop solution, then the final output
is written onto the Hadoop Distributed File System (HDFS). HDFS is a file
system that includes clusters of commodity servers that are used to store big data.
HDFS makes big data handling and storage financially feasible by distributing
storage tasks across clusters of cheap commodity servers.
Hadoop is an open-source data processing tool that was developed by the Apache
Software Foundation. Hadoop is currently the go-to program for handling huge
volumes and varieties of data because it was designed to make large-scale computing
more affordable and flexible. With the arrival of Hadoop, mass data processing has
been introduced to significantly more people and more organizations.
Hadoop can offer you a great solution to handle, process, and group mass streams of
structured, semi-structured, and unstructured data. By setting up and deploying
Hadoop, you get a relatively affordable way to begin using and drawing insights from
all of your organization’s data, rather than just continuing to rely solely on that
transactional dataset you have sitting over in an old data warehouse somewhere.
Hadoop is one of the most popular programs available for large-scale computing
requirements. Hadoop provides a map-and-reduce layer that’s capable of handling the
data processing requirements of most big data projects.
Sometimes the data gets too big and fast for even Hadoop to handle. In these
cases, organizations are turning to alternative, more-customized MapReduce
Hadoop uses clusters of commodity hardware for storing data. Hardware in each
cluster is connected, and this hardware is comprised of commodity servers — low-cost,
low-performing generic servers that offer powerful computing capabilities when run in
parallel across a shared cluster. These commodity servers are also called nodes.
Commoditized computing dramatically decreases the costs involved in handling and
storing big data.
Hadoop is comprised of the following two components:
A distributed processing framework: Hadoop uses Hadoop MapReduce as its
distributed processing framework. Again, a distributed processing framework is a
powerful framework where processing tasks are distributed across clusters of nodes
so that large data volumes can be processed very quickly across the system as a
A distributed file system: Hadoop uses the Hadoop Distributed File System
(HDFS) as its distributed file system.
The workloads of applications that run on Hadoop are divided among the nodes of the
Hadoop cluster, and then the output is stored on the HDFS. The Hadoop cluster can be
comprised of thousands of nodes. To keep the costs of input/output (I/O) processes low,
Hadoop MapReduce jobs are performed as close to the data as possible. This means
that the reduce tasks processors are positioned as closely as possible to the outgoing
map task data that needs to be processed. This design facilitates sharing of
computational requirements in big data processing.
Hadoop also supports hierarchical organization. Some of its nodes are classified as
master nodes, and others are categorized as slaves. The master service, known as
JobTracker, is designed to control several slave services. Slave services (also called
TaskTrackers) are distributed one to each node. The JobTracker controls the
TaskTrackers and assigns Hadoop MapReduce tasks to them. In a newer version of
Hadoop, known as Hadoop 2, a resource manager called Hadoop YARN was added.
With respect to MapReduce in Hadoop, YARN acts as an integrated system that
performs resource management and scheduling functions.
Hadoop processes data in batch. Consequently, if you’re working with realtime, streaming data, you won’t be able to use Hadoop to handle your big data
issues. This said, it’s very useful for solving many other types of big data
Seeing where Java, C++, and Python fit into your big data
You can program Hadoop using Java or Python. Hadoop supports programs that are written using a small
Application Program Interface (API) in either of these languages. To run scripts and binaries in clusters’ nodes in
Hadoop, you have to use the Hadoop-specific string I/O convention. Also, Hadoop allows you to use abstracted
forms of map and reduce. You can program these functions in Python and Lisp.
Identifying Alternative Big Data Solutions
Looking past Hadoop, you can see alternative big data solutions on the horizon. These
solutions make it possible to work with big data in real-time or to use alternative
database technologies to handle and process it. In the following sections, I introduce
you first to the real-time processing frameworks, then the Massively Parallel
Processing (MPP) platforms, and finally the NoSQL databases that allow you to work
with big data outside of the Hadoop environment.
You should be aware of something referred to as ACID compliance, short for
Atomicity, Consistency, Isolation, and Durability compliance. ACID compliance is
a standard by which accurate and reliable database transactions are guaranteed. In
big data solutions, most database systems are not ACID compliant, but this does
not necessarily pose a major problem. That’s because most big data systems use
Decision Support Systems (DSS) that batch process data before that data is read
out. DSS are information systems that are used for organizational decisionsupport. Non-transactional DSS demonstrate no real ACID compliance
Introducing real-time processing frameworks
Remember how Hadoop is a batch processor and can’t process real-time, streaming
data? Well, sometimes you might need to query big data streams in real-time … and
you just can’t do this sort of thing using Hadoop. In these cases, use a real-time
processing framework instead. A real-time processing framework is — as its name
implies — a framework that is able to process data in real-time (or near real-time) as
that data streams and flows into the system. Essentially, real-time processing
frameworks are the antithesis of the batch processing frameworks you see deployed in
Real-time processing frameworks can be classified into the following two categories:
Frameworks that lower the overhead of MapReduce tasks to increase the
overall time efficiency of the system: Solutions in this category include Apache
Storm and Apache Spark for near–real-time stream processing.
Frameworks that deploy innovative querying methods to facilitate real-time
querying of big data: Some solutions in this category include Google’s Dremel,
Apache Drill, Shark for Apache Hive, and Cloudera’s Impala.
Real-time, stream processing frameworks are quite useful in a multitude of industries
— from stock and financial market analyses to e-commerce optimizations, and from
real-time fraud detection to optimized order logistics. Regardless of the industry in
which you work, if your business is impacted by real-time data streams that are
generated by humans, machines, or sensors, then a real-time processing framework
would be helpful to you in optimizing and generating value for your organization.
Introducing Massively Parallel Processing (MPP)
Massively Parallel Processing (MPP) platforms can be used instead of MapReduce as
an alternative approach for distributed data processing. If your goal is to deploy parallel
processing on a traditional data warehouse, then an MPP may be the perfect solution.
To understand how MPP compares to a standard MapReduce parallel processing
framework, consider the following. MPP runs parallel computing tasks on costly,
custom hardware, whereas MapReduce runs them on cheap commodity servers.
Consequently, MPP processing capabilities are cost restrictive. This said, MPP is
quicker and easier to use than standard MapReduce jobs. That’s because MPP can be
queried using Structured Query Language (SQL), but native MapReduce jobs are
controlled by the more complicated Java programming language.
Well-known MPP vendors and products include the old-school Teradata
platform (www.teradata.com/), plus newer solutions like EMC2’s Greenplum
DCA (www.emc.com/campaign/global/greenplumdca/index.htm), HP’s Vertica
(www.vertica.com/), IBM's Netezza (www01.ibm.com/software/data/netezza/), and Oracle’s Exadata
Introducing NoSQL databases
Traditional relational database management systems (RDBMS) aren’t equipped to
handle big data demands. That’s because traditional relational databases are designed to
handle only relational datasets that are constructed of data that’s stored in clean rows
and columns and thus are capable of being queried via Structured Query Language
(SQL). RDBM systems are not capable of handling unstructured and semi-structured
data. Moreover, RDBM systems simply don’t have the processing and handling
capabilities that are needed for meeting big data volume and velocity requirements.
This is where NoSQL comes in. NoSQL databases, like MongoDB, are non-relational,
distributed database systems that were designed to rise to the big data challenge.
NoSQL databases step out past the traditional relational database architecture and offer
a much more scalable, efficient solution. NoSQL systems facilitate non-SQL data
querying of non-relational or schema-free, semi-structured and unstructured data. In
this way, NoSQL databases are able to handle the structured, semi-structured, and
unstructured data sources that are common in big data systems.
NoSQL offers four categories of non-relational databases — graph databases,
document databases, key-values stores, and column family stores. Since NoSQL offers
native functionality for each of these separate types of data structures, it offers very
efficient storage and retrieval functionality for most types of non-relational data. This
adaptability and efficiency makes NoSQL an increasingly popular choice for handling
big data and for overcoming processing challenges that come along with it.
There is somewhat of a debate about the significance of the name NoSQL.
Some argue that NoSQL stands for Not Only SQL, while others argue that the
acronym represents Non-SQL databases. The argument is rather complex and
there is no real cut-and-dry answer. To keep things simple, just think of NoSQL as
a class of non-relational database management systems that do not fall within the
spectrum of RDBM systems that are queried using SQL.
Data Engineering in Action — A Case
A Fortune 100 telecommunications company had large datasets that resided in separate
data silos — data repositories that are disconnected and isolated from other data
storage systems used across the organization. With the goal of deriving data insights
that lead to revenue increases, the company decided to connect all of its data silos, and
then integrate that shared source with other contextual, external, non-enterprise data
sources as well.
Identifying the business challenge
The company was stocked to the gills with all the traditional enterprise systems —
ERP, ECM, CRM, you name it. Slowly, over many years, these systems grew and
segregated into separate information silos — check out Figure 2-3 to see what I mean.
Because of the isolated structure of their data systems, otherwise useful data was lost
and buried deep within a mess of separate, siloed storage systems. Even if the company
knew what data they had, it would be like pulling teeth to access, integrate, and utilize
it. The company rightfully believed that this restriction was limiting their business
Figure 2-3: Different data silos being joined together by a common join point.
To optimize their sales and marketing return on investments, the company wanted to
integrate external, open datasets and relevant social data sources that would provide
deeper insights about its current and potential customers. But to build such a 360
degree view of the target market and customer base, the company needed to develop a
pretty sophisticated platform across which the data could be integrated, mined, and
The company had the following three goals in mind for the project:
Manage and extract value from disparate, isolated datasets.
Take advantage of information from external, non-enterprise, or social data sources
to provide new, exciting, and useful services that create value.
Identify specific trends and issues in competitor activity, product offerings,
industrial customer segments, and sales team member profiles.
Solving business problems with data engineering
To meet the company’s goals, data engineers moved the company’s datasets to Hadoop
clusters. One cluster hosted the sales data, another hosted the human resources data,
and yet another hosted the talent management data. Data engineers then modeled the
data using the linked data format — a format that facilitates a joining of the different
datasets in the Hadoop clusters.
After this big data platform architecture was put into place, queries that would have
traditionally taken several hours to perform could be performed in a matter of minutes.
New queries were generated after the platform was built, and these queries also
returned efficient results within a few minutes’ time.
Boasting about benefits
The following are some of the benefits that the telecommunications company now
enjoys as a result of its new big data platform:
Ease of scaling: Scaling is much easier and cheaper using Hadoop than it was with
the old system. Instead of increasing capital and operating expenditures by buying
more of the latest generation of expensive computers, servers, and memory
capacity, the company opted to grow wider instead. They were able to purchase
more hardware and add new commodity servers in a matter of hours, rather than
Performance: With their distributed processing and storage capabilities, the
Hadoop clusters deliver insights faster and produce more data insight for less cost.
High availability and reliability: The company has found that the Hadoop
platform is providing them with data protection and high availability while the
clusters grow in size. Additionally, the Hadoop clusters have increased system
reliability because of their automatic failover configuration — a configuration that
facilitates an automatic switch to redundant, backup data handling systems in
instances where the primary system might fail.
Applying Data Science to Business and
In This Chapter
Seeing the benefits of business-centric data science
Knowing business intelligence from business-centric data science
Finding the expert to call when you want the job done right
Seeing how a real-world business put data science to good use
To the nerds and geeks out there, data science is interesting in its own right, but to most
people, it’s interesting only because of the benefits it can generate. Most business
managers and organizational leaders couldn’t care less about coding and complex
statistical algorithms. They are, on the other hand, extremely interested in finding new
ways to increase business profits by increasing sales rates and decreasing
inefficiencies. In this chapter, I introduce the concept of business-centric data science,
discuss how it differs from traditional business intelligence, and talk about how you
can use data-derived business insights to increase your business’ bottom line.
Incorporating Data-Driven Insights into the
The modern business world is absolutely deluged with data. That’s because every line
of business, every electronic system, every desktop computer, every laptop, every
company-owned cellphone, and every employee is continually creating new businessrelated data as a natural and organic output of their work. This data is structured or
unstructured, some of it is big and some of it is small, fast or slow; maybe it’s tabular
data, or video data, or spatial data, or data that no one has come up with a name for yet.
But while there are many varieties and variations between the types of datasets
produced, the challenge is only one — to extract data insights that add value to the
organization when acted upon. In the following sections, I walk you through the
challenges involved in deriving value from actionable insights that are generated from
raw business data.
Benefiting from business-centric data science
Business is complex. Data science is complex. At times, it’s easy to get so caught up
looking at the trees that you forget to look for a way out of the forest. That’s why, in all
areas of business, it’s extremely important to stay focused on the end goal. Ultimately,
no matter what line of business you’re in, true north is always the same — business
profit growth. Whether you achieve that by creating greater efficiencies or by
increasing sales rates and customer loyalty, the end goal is to create a more stable, solid
profit-growth rate for your business. The following is a list of some of the ways that
you can use business-centric data science and business intelligence to help increase
Decrease financial risks. A business-centric data scientist can decrease financial
risk in ecommerce business by using time series anomaly detection methods for
real-time fraud detection — todecrease Card-Not-Present fraud and to decrease
incidence of account takeovers, to take two examples.
Increase the efficiencies of systems and processes. This is a business systems
optimization function that’s performed by both the business-centric data scientist
and the business analyst. Both use analytics to optimize business processes,
structures, and systems, but their methods and data sources differ. The end goal
here should be to decrease needless resource expenditures and to increase return on
investment for justified expenditures.
Increase sales rates. To increase sales rates for your offerings, you can employ a
business-centric data scientist to help you find the best ways to upsell and crosssell, increase customer loyalty, increase conversions in each layer of the funnel, and
exact-target your advertising and discounts. It’s likely that your business is already
employing many of these tactics, but a business-centric data scientist can look at all
data related to the business and, from that, derive insights that supercharge these
Deploying analytics and data wrangling to convert
raw data into actionable insights
Turning your raw data into actionable insights is the first step in the progression from
the data you’ve collected to something that actually benefits you. Business-centric data
scientists use data analytics to generate insights from raw data.
Identifying the types of analytics
Listed below, in order of increasing complexity, are the four types of data analytics
you’ll most likely encounter:
Descriptive analytics: This type of analytics answers the question, “What
happened?” Descriptive analytics are based on historical and current data. A
business analyst or a business-centric data scientist bases modern-day business
intelligence on descriptive analytics.
Diagnostic analytics: You use this type of analytics to find answers to the
question, “why did this particular something happen?” or “what went wrong?”
Diagnostic analytics are useful for deducing and inferring the success or failure of
sub-components of any data-driven initiative.
Predictive analytics: Although this type of analytics is based on historical and
current data, predictive analytics go one step further than descriptive analytics.
Predictive analytics involve complex model-building and analysis in order to
predict a future event or trend. In a business context, these analyses would be
performed by the business-centric data scientist.
Prescriptive analytics: This type of analytics aims to optimize processes,
structures, and systems through informed action that’s based on predictive analytics
— essentially telling you what you should do based on an informed estimation of
what will happen. Both business analysts and business-centric data scientists can
generate prescriptive analytics, but their methods and data sources differ.
Ideally, a business should engage in all four types of data analytics, but
prescriptive analytics is the most direct and effective means by which to generate
value from data insights.
Identifying common challenges in analytics
Analytics commonly pose at least two challenges in the business enterprise. First,
organizations often have a very hard time finding new hires with specific skill sets that
include analytics. Second, even skilled analysts often have difficulty communicating
complex insights in a way that’s understandable to management decision makers.
To overcome these challenges, the organization must create and nurture a culture that
values and accepts analytics products. The business must work to educate all levels of
the organization, so that management has a basic concept of analytics and the success
that can be achieved by implementing them. Conversely, business-centric data
scientists must have a very solid working knowledge about business in general and, in
particular, a solid understanding of the business at hand. A strong business knowledge
is one of the three main requirements of any business-centric data scientist — the other
two being a strong coding acumen and strong quantitative analysis skills via math and
Wrangling raw data to actionable insights
Data wrangling is another important portion of the work that’s required to convert data
to insights. To build analytics from raw data, you’ll almost always need to use data
wrangling — the processes and procedures that you use to clean and convert data from
one format and structure to another so that the data is accurate and in the format
analytics tools and scripts require for consumption. The following list highlights a few
of the practices and issues I consider most relevant to data wrangling:
Data extraction: The business-centric data scientist must first identify what
datasets are relevant to the problem at hand, and then extract sufficient quantities of
the data that’s required to solve the problem. (This extraction process is commonly
referred to as data mining.)
Data munging: Data munging involves cleaning the raw data extracted through
data mining, then converting it into a format that allows for a more convenient
consumption of the data. (Mung began life as a destructive process, where you
would convert something recognizable into something that was unrecognizable,
thus the phrase Mash Until No Good, or MUNG.)
Data governance: Data governance standards are standards that are used as a
quality control measure to ensure that manual and automated data sources conform
to the data standards of the model at hand. Data governance standards must be
applied so that the data is at the right granularity when it’s stored and made ready
Granularity is a measure of a dataset’s level of detail. Data granularity is
determined by the relative size of the sub-groupings into which the data is divided.
Data architecture: IT architecture is key. If your data is isolated in separate, fixed
repositories — those infamous data silos everybody complains about — then it’s
available to only a few people within a particular line of business. Siloed data
structures result in scenarios where a majority of an organization’s data is simply
unavailable for use by the organization at large. (Needless to say, siloed data
structures are incredibly wasteful and inefficient.)
If your goal is to derive the most value and insight from your organization’s
business data, then you should ensure that the data is stored in a central data
warehouse and not in separate silos. (I discuss data warehouses in the “Looking at
technologies and skillsets that are useful in business intelligence” section, later in
Taking action on business insights
After wrangling your data down to actionable insights, the second step in the
progression from raw data to value-added is to take decisive actions based on those
insights. In business, the only justifiable purpose for spending time deriving insights
from raw data is that the actions should lead to an increase in business profits. Failure
to take action on data-driven insights results in a complete and total loss of the
resources that were spent deriving them, at no benefit whatsoever to the organization.
An organization absolutely must be ready and equipped to change, evolve, and
progress when new business insights become available.
What I like to call the insight-to-action arc — the process of taking decisive
actions based on data insights — should be formalized in a written action plan and
then rigorously exercised to affect continuous and iterative improvements to your
organization — iterative because these improvements involve a successive round
of deployments and testing to optimize all areas of business based on actionable
insights that are generated from organizational data. This action plan is not
something that should just be tacked loosely on the side of your organization, and
then never looked at again.
To best prepare your organization to take action on insights derived from business data,
make sure you have the following people and systems in place and ready to go:
Right data, right time, right place: This part isn’t complicated: You just have to
have the right data, collected and made available at the right places and the right
times, when it’s needed the most.
Business-centric data scientists and business analysts: Have business-centric
data scientists and business analysts in place and ready to tackle problems when
Educated and enthusiastic management: Educate and encourage your
organization’s leaders so that you have a management team that understands,
values, and makes effective use of business insights gleaned from analytics.
Informed and enthusiastic organizational culture: If the culture of your
organization reflects a naivety or lack of understanding about the value of data,
begin fostering a corporate culture that values data insights and analytics. Consider
using training, workshops, and events.
Written procedures with clearly designated chains of responsibility: Have
documented processes in place and interwoven into your organization so that when
the time comes, the organization is prepared to respond. New insights are generated
all the time, but growth is achieved only through iterative adjustments and actions
based on constantly evolving data insights. The organization needs to have clearly
defined procedures ready to accommodate these changes as necessary.
Advancement in technology: Your enterprise absolutely must keep up-to-date
with rapidly changing technological developments. The analytics space is changing
fast — very fast! There are many ways to keep up. If you keep in-house experts,
you can assign them the ongoing responsibility of monitoring industry
advancements and then suggesting changes that are needed to keep your
organization current. An alternative way to keep current is to purchase cloud-based
Software-as-a-Service (SaaS) subscriptions and then rely on SaaS platform
upgrades to keep you up to speed on the most innovative and cutting-edge
When relying on SaaS platforms to keep you current, you’re taking a leap of
faith that the vendor is working hard to keep on top of industry advancements and
not just letting things slide. Ensure that the vendor has a long-standing history of
maintaining up-to-date and reliable services over time. Although you could try to
follow the industry yourself and then check back with the vendor on updates as
new technologies emerge, that is putting a lot of onus on you. Unless you’re a data
technology expert with a lot of free time to research and inquire about
advancements in industry standards, it’s better to just choose a reliable vendor that
has an excellent reputation for delivering up-to-date, cutting-edge technologies to
Distinguishing Business Intelligence and
Business-centric data scientists and business analysts who do business intelligence are
like cousins. They both use data to work towards the same business goal, but their
approach, technology, and function differ by measurable degrees. In the following
sections, I define, compare, and distinguish between business intelligence and
business-centric data science.
Defining business intelligence
The purpose of business intelligence is to convert raw data into business insights that
business leaders and managers can use to make data-informed decisions. Business
analysts use business intelligence tools to create decision-support products for business
management decision making. If you want to build decision-support dashboards,
visualizations, or reports from complete medium-sized sets of structured business data,
then you can use business intelligence tools and methods to help you.
Business intelligence (BI) is comprised of
Mostly internal datasets: By internal, I mean business data and information that’s
supplied by your organization’s own managers and stakeholders.
Tools, technologies, and skillsets: Examples here would include online analytical
processing, ETL (extracting, transforming, and loading data from one database into
another), data warehousing, and information technology for business applications.
Looking at the kinds of data used in business intelligence
Insights that are generated in business intelligence (BI) are derived from standard-sized
sets of structured business data. BI solutions are mostly built off of transactional data
— data that’s generated during the course of a transaction event, like data generated
during a sale, or during a money transfer between bank accounts, for example.
Transactional data is a natural byproduct of business activities that occur across an
organization, and all sorts of inferences can be derived from it. You can use BI to
derive the following types of insights:
Customer service data: Possibly answering the question, “what areas of business
are causing the largest customer wait times?”
Sales and marketing data: Possibly answering the question, “which marketing
tactics are most effective and why?”
Operational data: Possibly answering the questions, “how efficiently is the help
desk operating? Are there any immediate actions that must be taken to remedy a
Employee performance data: Possibly answering the questions, “which
employees are the most productive? Which are the least?”
Looking at technologies and skillsets that are useful in business
To streamline BI functions, make sure that your data is organized for optimal ease of
access and presentation. You can use multidimensional databases to help you. Unlike
relational, or flat databases, multidimensional databases organize data into cubes that
are stored as multi-dimensional arrays. If you want your BI staff to be able to work
with source data as quickly and easily as possible, you can use multidimensional
databases to store data in a cube, instead of storing the data across several relational
databases that may or may not be compatible with one another.
This cubic data structure enables Online Analytical Processing (OLAP) — a
technology through which you can quickly and easily access and use your data for all
sorts of different operations and analyses. To illustrate the concept of OLAP, imagine
that you have a cube of sales data that has three dimensions — time, region, and
business unit. You can slice the data to view only one rectangle — to view one sales
region, for instance. You can dice the data to view a smaller cube made up of some
subset of time, region(s), and business unit(s). You can drill down or up to view either
highly detailed or highly summarized data, respectively. And you can roll up, or total,
the numbers along one dimension — to total business unit numbers, for example, or to
view sales across time and region only.
OLAP is just one type of data warehousing system — a centralized data repository that
you can use to store and access your data. A more traditional data warehouse system
commonly employed in business intelligence solutions is a data mart — a data storage
system that you can use to store one particular focus area of data, belonging to only one
line of business in the enterprise. Extract, transform, and load (ETL) is the process that
you’d use to extract data, transform it, and load it into your database or data warehouse.
Business analysts generally have strong backgrounds and training in business and
information technology. As a discipline, BI relies on traditional IT technologies and
Defining business-centric data science
Within the business enterprise, data science serves the same purpose that business
intelligence does — to convert raw data into business insights that business leaders and
managers can use to make data-informed decisions. If you have large sets of structured
and unstructured data sources that may or may not be complete and you want to
convert those sources into valuable insights for decision support across the enterprise,
call on a data scientist. Business-centric data science is multi-disciplinary and
incorporates the following elements:
Quantitative analysis: Can be in the form of mathematical modeling, multivariate
statistical analysis, forecasting, and/or simulations.
The term multivariate refers to more than one variable. A multivariate
statistical analysis is a simultaneous statistical analysis of more than one variable at
Programming skills: You need the necessary programming skills to both analyze
raw data and make this data accessible to business users.
Business knowledge: You need knowledge of the business and its environment so
that you can better understand the relevancy of your findings.
Data science is a pioneering discipline. Data scientists often employ the scientific
method for data exploration, hypotheses formation, and hypothesis testing (through
simulation and statistical modeling). Business-centric data scientists generate valuable
data insights, oftentimes by exploring patterns and anomalies in business data. Data
science in a business context is commonly comprised of
Internal and external datasets: Data science is flexible. You can create business
data mash-ups from internal and external sources of structured and unstructured
data fairly easily. (A data mash-up is combination of two or more data sources that
are then analyzed together in order to provide users with a more complete view of
the situation at hand.)
Tools, technologies, and skillsets: Examples here could involve using cloud-based
platforms, statistical and mathematical programming, machine learning, data
analysis using Python and R, and advanced data visualization.
Like business analysts, business-centric data scientists produce decision-support
products for business managers and organizational leaders to use. These products
include analytics dashboards and data visualizations, but generally not tabular data
reports and tables.
Looking at the kinds of data that are useful in business-centric data
You can use data science to derive business insights from standard-sized sets of
structured business data (just like BI) or from structured, semi-structured, and
unstructured sets of big data. Data science solutions are not confined to transactional
data that sits in a relational database; you can use data science to create valuable
insights from all available data sources. These data sources include
Transactional business data: A tried-and-true data source, transactional business
data is the type of structured data used in traditional BI and includes management
data, customer service data, sales and marketing data, operational data, and
employee performance data.
Social data related to the brand or business: A more recent phenomenon, the
data covered by this rubric includes the unstructured data generated through emails,
instant messaging, and social networks such as Twitter, Facebook, LinkedIn,
Pinterest, and Instagram.
Machine data from business operations: Machines automatically generate this
unstructured data, like SCADA data, machine data, or sensor data.
The acronym SCADA refers to Supervisory Control and Data Acquisition.
SCADA systems are used to control remotely operating mechanical systems and
equipment. They generate data that is used to monitor the operations of machines
Audio, video, image, and PDF file data: These well-established formats are all
sources of unstructured data.
Looking at the technologies and skillsets that are useful in businesscentric data science
Since the products of data science are often generated from big data, cloud-based data
platform solutions are common in the field. Data that’s used in data science is often
derived from data-engineered big data solutions, like Hadoop, MapReduce, and
Massively Parallel Processing. (For more on these technologies, check out Chapter 2.)
Data scientists are innovative, forward-thinkers who must often think outside-the-box
in order to exact solutions to the problems they solve. Many data scientists tend toward
open-source solutions when available. From a cost perspective, this approach benefits
the organizations that employ these scientists.
Business-centric data scientists might use machine learning techniques to find patterns
in (and derive insights from) huge datasets that are related to a line of business or the
business at large. They’re skilled in math, statistics, and programming, and they
sometimes use these skills to generate predictive models. They generally know how to
program in Python or R. Most of them know how to use SQL to query relevant data
from structured databases. They are usually skilled at communicating data insights to
end users — in business-centric data science, end users are business managers and
organizational leaders. Data scientists must be skillful at using verbal, oral, and visual
means to communicate valuable data insights.
Although business-centric data scientists serve a decision-support role in the
enterprise, they’re different from the business analyst in that they usually have
strong academic and professional backgrounds in math, science, engineering, or
all of the above. This said, business-centric data scientists also have a strong
substantive knowledge of business management.
Summarizing the main differences between BI and
business-centric data science
The similarities between BI and business-centric data science are glaringly obvious; it’s
the differences that most people have a hard time discerning. The purpose of both BI
and business-centric data science is to convert raw data into actionable insights that
managers and leaders can use for support when making business decisions.
BI and business-centric data science differ with respect to approach. Although BI can
use forward-looking methods like forecasting, these methods are generated by making
simple inferences from historical or current data. In this way, BI extrapolates from the
past and present to infer predictions about the future. It looks to present or past data for
relevant information to help monitor business operations and to aid managers in shortto medium-term decision making.
In contrast, business-centric data science practitioners seek to make new discoveries by
using advanced mathematical or statistical methods to analyze and generate predictions
from vast amounts of business data. These predictive insights are generally relevant to
the long-term future of the business. The business-centric data scientist attempts to
discover new paradigms and new ways of looking at the data to provide a new
perspective on the organization, its operations, and its relations with customers,
suppliers, and competitors. Therefore, the business-centric data scientist must know the
business and its environment. She must have business knowledge to determine how a
discovery is relevant to a line of business or to the organization at large.
Other prime differences between BI and business-centric data science are
Data sources: BI uses only structured data from relational databases, whereas
business-centric data science may use structured data and unstructured data, like
that generated by machines or in social media conversations.
Outputs: BI products include reports, data tables, and decision-support dashboards,
whereas business-centric data science products either involve dashboard analytics
or another type of advanced data visualization, but rarely tabular data reports. Data
scientists generally communicate their findings through words or data
visualizations, but not tables and reports. That’s because the source datasets from
which data scientists work are generally more complex than a typical business
manager would be able to understand.
Technology: BI runs off of relational databases, data warehouses, OLAP, and ETL
technologies, whereas business-centric data science often runs off of data from
data-engineered systems that use Hadoop, MapReduce, or Massively Parallel
Expertise: BI relies heavily on IT and business technology expertise, whereas
business-centric data science relies on expertise in statistics, math, programming,
Knowing Who to Call to Get the Job Done
Since most business managers don’t know how to do advanced data work themselves,
it’s definitely beneficial to at least know what type of problems are best-suited for a
business analyst and what problems should be handled by a data scientist instead.
If you want to use enterprise data insights to streamline your business so that its
processes function more efficiently and effectively, then bring in a business analyst.
Organizations employ business analysts so that they have someone to cover the
responsibilities associated with requirements management, business process analysis,
and improvements-planning for business processes, IT systems, organizational
structures, and business strategies. Business analysts look at enterprise data and
identify what processes need improvement. They then create written specifications that
detail exactly what changes should be made for improved results. They produce
interactive dashboards and tabular data reports to supplement their recommendations
and to help business managers better understand what is happening in the business.
Ultimately, business analysts use business data to further the organization’s strategic
goals and to support them in providing guidance on any procedural improvements that
need to be made.
In contrast, if you want to obtain answers to very specific questions on your data, and
you can obtain those answers only via advanced analysis and modeling of business
data, then bring in a business-centric data scientist. Many times, a data scientist may
support the work of a business analyst. In such cases, the data scientist might be asked
to analyze very specific data-related problems and then report the results back to the
business analyst to support him in making recommendations. Business analysts can use
the findings of business-centric data scientists to help them determine how to best
fulfill a requirement or build a business solution.
Exploring Data Science in Business: A
Data-Driven Business Success Story
Southeast Telecommunications Company was losing many of its customers to customer
churn — the customers were simply moving to other telecom service providers.
Because it’s significantly more expensive to acquire new customers than it is to retain
existing customers, Southeast’s management wanted to find a way to decrease their
churn rates. So, Southeast Telecommunications engaged Analytic Solutions, Inc. (ASI),
a business-analysis company. ASI interviewed Southeast’s employees, regional
managers, supervisors, front-line employees, and help-desk employees. After
consulting with personnel, they collected business data that was relevant to customer
ASI began examining several years’ worth of Southeast’s customer data to develop a
better understanding of customer behavior and why some people left after years of
loyalty, while others continued to stay on. The customer datasets contained records for
the number of times a customer had contacted Southeast’s help desk, the number of
customer complaints, and the number of minutes and megabytes of data each customer
used per month. ASI also had demographic and personal data (credit score, age, and
region, for example) that was contextually relevant to the evaluation.
By looking at this customer data, ASI discovered the following insights. Within the
one-year time interval before switching service providers
Eighty-four percent of customers who left Southeast had placed two or more calls
into its help desk in the nine months before switching providers.
Sixty percent of customers who switched showed drastic usage drops in the six
months prior to switching.