Programming Skills for Data Science: Start Writing Code to Wrangle, Analyze, and Visualize Data with R, First EditionMichael Freeman & Joel Ross [Michael Freeman]
The Foundational Hands-On Skills You Need to Dive into Data Science
“Freeman and Ross have created the definitive resource for new and aspiring data scientists to learn foundational programming skills.”
–From the foreword by Jared Lander, series editor
Using data science techniques, you can transform raw data into actionable insights for domains ranging from urban planning to precision medicine. Programming Skills for Data Science brings together all the foundational skills you need to get started, even if you have no programming or data science experience.
Leading instructors Michael Freeman and Joel Ross guide you through installing and configuring the tools you need to solve professional-level data science problems, including the widely used R language and Git version-control system. They explain how to wrangle your data into a form where it can be easily used, analyzed, and visualized so others can see the patterns you've uncovered. Step by step, you'll master powerful R programming techniques and troubleshooting skills for probing data in new ways, and at larger scales.
Freeman and Ross teach through practical examples and exercises that can be combined into complete data science projects. Everything's focused on real-world application, so you can quickly start analyzing your own data and getting answers you can act upon. Learn to
- Install your complete data science environment, including R and RStudio
- Manage projects efficiently, from version tracking to documentation
- Host, manage, and collaborate on data science projects with GitHub
- Master R language fundamentals: syntax, programming concepts, and data structures
- Load, format, explore, and restructure data for successful analysis
- Interact with databases and web APIs
- Master key principles for visualizing data accurately and intuitively
- Produce engaging, interactive visualizations with ggplot and other R packages
- Transform analyses into sharable documents and sites with R Markdown
- Create interactive web data science applications with Shiny
- Collaborate smoothly as part of a data science team
Register your book for convenient access to downloads, updates, and/or corrections as they become available. See inside book for details.
You may be interested in
Most frequently terms
12 Reshaping Data with tidyr One of the most common data wrangling challenges is adjusting how exactly row and columns are used to represent your data. Structuring (or restructuring) data frames to have the desired shape can be the most difficult part of creating a visualization, running a statistical model, or implementing a machine learning algorithm. This chapter describes how you can use the tidyr (“tidy-er”) package to effectively transform your data into an appropriate shape for analysis and visualization. 12.1 What Is “Tidy” Data? When wrangling data into a data frame for your analysis, you need to decide on the desired structure of that data frame. You need to determine what each row and column will represent, so that you can consistently and clearly manipulate that data (e.g., you know what you will be selecting and what you will be filtering). The tidyr package is used to structure and work with data fames that follow three principles of tidy data (as described by the package’s documentation1): 1tidyr: https://tidyr.tidyverse.org Each variable is in a column. Each observation is a row. Each value is a cell. Indeed, these principles lead to the data structuring described in Chapter 9: rows represent observations, and columns represent features of that data. However, asking different questions of a data set may involve different interpretations of what constitutes an “observation.” For example, Section 11.6 described working with the flights data set from the nycflights13 package, in which each observation is a flight. However, the analysis made comparisons between airlines, airports, and months. Each question worked with a different unit of analysis, implying a different data structure (e.g., what should be represented by each row). While the example somewhat changed the nature of these rows by grouping and joining different data sets, having a more specific data structure where each row represented a specific unit of analysis (e.g., an airline or a month) may have made much of the wrangling and analysis more straightforward. To use multiple different definitions of an “observation” when investigating your data, you will need to create multiple representations (i.e., data frames) of the same data set—each with its own configuration of rows and columns. To demonstrate how you may need to adjust what each observation represents, consider the (fabricated) data set of music concert prices shown in Table 12.1. In this table, each observation (row) represents a city, with each city having features (columns) of the ticket price for a specific band. Table 12.1 A “wide” data set of concert ticket price in different cities. Each observation (i.e., unit of analysis) is a city, and each feature is the concert ticket price for a given band. city greensky_bluegrass trampled_by_turtles billy_strings fruition Seattle 40 30 15 30 Portland 40 20 25 50 Denver 20 40 25 40 Minneapolis 30 100 15 20 But consider if you wanted to analyze the ticket price across all concerts. You could not do this easily with the data in its current form, since the data is organized by city (not by concert)! You would prefer instead that all of the prices were listed in a single column, as a feature of a row representing a single concert (a city-and-band combination), as in Table 12.2. Table 12.2 A “long” data set of concert ticket price by city and band. Each observation (i.e., unit of analysis) is a city–band combination, and each has a single feature that is the ticket price. city band price Seattle greensky_bluegrass 40 Portland greensky_bluegrass 40 Denver greensky_bluegrass 20 Minneapolis greensky_bluegrass 30 Seattle trampled_by_turtles 30 Portland trampled_by_turtles 20 Denver trampled_by_turtles 40 Minneapolis trampled_by_turtles 100 Seattle billy_strings 15 Portland billy_strings 25 Denver billy_strings 25 Minneapolis billy_strings 15 Seattle fruition 30 Portland fruition 50 Denver fruition 40 Minneapolis fruition 20 Both Table 12.1 and Table 12.2 represent the same set of data—they both have prices for 16 different concerts. But by representing that data in terms of different observations, they may better support different analyses. These data tables are said to be in a different orientation: the price data in Table 12.1 is often referred to being in wide format (because it is spread wide across multiple columns), while the price data in Table 12.2 is in long format (because it is in one long column). Note that the long format table includes some duplicated data (the names of the cities and bands are repeated), which is part of why the data might instead be stored in wide format in the first place! 12.2 From Columns to Rows: gather() Sometimes you may want to change the structure of your data—how your data is organized in terms of observations and features. To help you do so, the tidyr package provides elegant functions for transforming between orientations. For example, to move from wide format (Table 12.1) to long format (Table 12.2), you need to gather all of the prices into a single column. You can do this using the gather() function, which collects data values stored across multiple columns into a single new feature (e.g., “price” in Table 12.2), along with an additional new column representing which feature that value was gathered from (e.g., “band” in Table 12.2). In effect, it creates two columns representing key–value pairs of the feature and its value from the original data frame. Click here to view code image # Reshape by gathering prices into a single feature band_data_long <- gather( band_data_wide, # data frame to gather from key = band, # name for new column listing the gathered features value = price, # name for new column listing the gathered values -city # columns to gather data from, as in dplyr's `select()` ) The gather() function takes in a number of arguments, starting with the data frame to gather from. It then takes in a key argument giving a name for a column that will contain as values the column names the data was gathered from—for example, a new band column that will contains the values "greensky_bluegrass", "trampled_by_turtles", and so on. The third argument is a value, which is the name for the column that will contain the gathered values—for example, price to contain the price numbers. Finally, the function takes in arguments representing which columns to gather data from, using syntax similar to using dplyr to select() those columns (in the preceding example, -city indicates that it should gather from all columns except city). Again, any columns provided as this final set of arguments will have their names listed in the key column, and their values listed in the value column. This process is illustrated in Figure 12.1. The gather() function’s syntax can be hard to intuit and remember; try tracing where each value “moves” in the table and diagram. [image: A screenshot to explain the "gather" function.] Figure 12.1 The gather() function takes values from multiple columns (greensky_bluegrass, trampled_by_turtles, etc.) and gathers them into a (new) single column (price). In doing so, it also creates a new column (band) that stores the names of the columns that were gathered (i.e., the column name in which each value was stored prior to gathering). A sample data frame with five columns: city, greensky_bluegrass, trampled_by_turtles, billy_strings, and fruition with four rows is shown. After using the function, "gather(band_data, key = band, value = price, -city)" two new columns: band and price are formed. The columns, greensky_bluegrass, trampled_by_turtles, billy_strings, and fruition of the data frame are grouped under "band" where the entire data under these columns are grouped under "price". Thus the resultant data frame has three columns: city, band, and price, and 16 rows. Note that once data is in long format, you can continue to analyze an individual feature (e.g., a specific band) by filtering for that value. For example, filter(band_data_long, band == "greensky_bluegrass") would produce just the prices for a single band. 12.3 From Rows to Columns: spread() It is also possible to transform a data table from long format into wide format—that is, to spread out the prices into multiple columns. Thus, while the gather() function collects multiple features into two columns, the spread() function creates multiple features from two existing columns. For example, you can take the long format data shown in Table 12.2 and spread it out so that each observation is a band, as in Table 12.3: Click here to view code image # Reshape long data (Table 12.2), spreading prices out among multiple features price_by_band <- spread( band_data_long, # data frame to spread from key = city, # column indicating where to get new feature names value = price # column indicating where to get new feature values ) Table 12.3 A “wide” data set of concert ticket prices for a set of bands. Each observation (i.e., unit of analysis) is a band, and each feature is the ticket price in a given city. band Denver Minneapolis Portland Seattle billy_strings 25 15 25 15 fruition 40 20 50 30 greensky_bluegrass 20 30 40 40 trampled_by_turtles 40 100 20 30 The spread() function takes arguments similar to those passed to the gather() function, but applies them in the opposite direction. In this case, the key and value arguments are where to get the column names and values, respectively. The spread() function will create a new column for each unique value in the provided key column, with values taken from the value feature. In the preceding example, the new column names (e.g., "Denver", "Minneapolis") were taken from the city feature in the long format table, and the values for those columns were taken from the price feature. This process is illustrated in Figure 12.2. [image: A screenshot to explain the "spread" function.] Figure 12.2 The spread() function spreads out a single column into multiple columns. It creates a new column for each unique value in the provided key column (city). The values in each new column will be populated with the provided value column (price). A sample data frame with three columns: city, band, and price, and 16 rows is shown. After using the function, "spread(band_data_long, key = city, value = price)," the city column is divided into four new columns, based on the entries in the city column: Denver, Minneapolis, Portland, and Seattle. The newly formed columns are populated with the values from the price column. Thus the resultant data frame has five columns: band, Denver, Minneapolis, Portland, and Seattle, and 4 rows. By combining gather() and spread(), you can effectively change the “shape” of your data and what concept is represented by an observation. Tip Before spreading or gathering your data, you will often need to unite multiple columns into a single column, or to separate a single column into multiple columns. The tidyr functions unite()a and separate()b provide a specific syntax for these common data preparation tasks. ahttps://tidyr.tidyverse.org/reference/unite.html bhttps://tidyr.tidyverse.org/reference/separate.html 12.4 tidyr in Action: Exploring Educational Statistics This section uses a real data set to demonstrate how reshaping your data with tidyr is an integral part of the data exploration process. The data in this example was downloaded from the World Bank Data Explorer,2 which is a data collection of hundreds of indicators (measures) of different economic and social development factors. In particular, this example considers educational indicators3 that capture a relevant signal of a country’s level of (or investment in) education—for example, government expenditure on education, literacy rates, school enrollment rates, and dozens of other measures of educational attainment. The imperfections of this data set (unnecessary rows at the top of the .csv file, a substantial amount of missing data, long column names with special characters) are representative of the challenges involved in working with real data sets. All graphics in this section were built using the ggplot2 package, which is described in Chapter 16. The complete code for this analysis is also available online in the book’s code repository.4 2World Bank Data Explorer: https://data.worldbank.org 3World Bank education: http://datatopics.worldbank.org/education 4tidyr in Action: https://github.com/programming-for-data-science/in-action/tree/master/tidyr After having downloaded the data, you will need to load it into your R environment: Click here to view code image # Load data, skipping the unnecessary first 4 rows wb_data <- read.csv( "data/world_bank_data.csv", stringsAsFactors = F, skip = 4 ) When you first load the data, each observation (row) represents an indicator for a country, with features (columns) that are the values of that indicator in a given year (see Figure 12.3). Notice that many values, particularly for earlier years, are missing (NA). Also, because R does not allow column names to be numbers, the read.csv() function has prepended an X to each column name (which is just a number in the raw .csv file). [image: A screenshot shows a data frame representing the untransformed World Bank educational data. It has seven columns: Country.Name, Country.Code, Indicator.Name, Indicator.Code, X1960, X1961, and X1962, and seven rows.] Figure 12.3 Untransformed World Bank educational data used in Section 12.4. While in terms of the indicator this data is in long format, in terms of the indicator and year the data is in wide format—a single column contains all the values for a single year. This structure allows you to make comparisons between years for the indicators by filtering for the indicator of interest. For example, you could compare each country’s educational expenditure in 1990 to its expenditure in 2014 as follows: Click here to view code image # Visually compare expenditures for 1990 and 2014 # Begin by filtering the rows for the indicator of interest indicator <- "Government expenditure on education, total (% of GDP)" expenditure_plot_data <- wb_data %>% filter(Indicator.Name == indicator) # Plot the expenditure in 1990 against 2014 using the `ggplot2` package # See Chapter 16 for details expenditure_chart <- ggplot(data = expenditure_plot_data) + geom_text_repel( mapping = aes(x = X1990 / 100, y = X2014 / 100, label = Country.Code) ) + scale_x_continuous(labels = percent) + scale_y_continuous(labels = percent) + labs(title = indicator, x = "Expenditure 1990", y = "Expenditure 2014") Figure 12.4 shows that the expenditure (relative to gross domestic product) is fairly correlated between the two time points: countries that spent more in 1990 also spent more in 2014 (specifically, the correlation—calculated in R using the cor() function—is .64). [image: A graph named "Government expenditure on education, total (percentage of GDP)" shows a comparison of each countrys education expenditures from 1990 to 2014.] Figure 12.4 A comparison of each country’s education expenditures in 1990 and 2014. The horizontal axis of the graph represents "Expenditures in 1990" ranging from 0 to 12.5 in increments of 2.5. The vertical axis represents "Expenditures in 2014" ranging from 0 to 8 in increments of 2. The comparison involves 37 countries, where all the plotted values are less than 7.5 of the horizontal axis. In the region from (2.5, 6) to (7, 6), most countries are accumulated, hence forming a denser region. Other countries are randomly scattered around the denser region. For example, Sweden (SWE) is plotted at (5.2, 8), Finland (FIN) is at (5, 7.8), and so on, where the first value is for the year 1990 and the second value is for the year 2014. Note: All the values are marked approximately. However, if you want to extend your analysis to visually compare how the expenditure across all years varies for a given country, you would need to reshape the data. Instead of having each observation be an indicator for a country, you want each observation to be an indicator for a country for a year—thereby having all of the values for all of the years in a single column and making the data long(er) format. To do this, you can gather() the year columns together: Click here to view code image # Reshape the data to create a new column for the `year` long_year_data <- wb_data %>% gather( key = year, # `year` will be the new key column value = value, # `value` will be the new value column X1960:X # all columns between `X1960` and `X` will be gathered ) As shown in Figure 12.5, this gather() statement creates a year column, so each observation (row) represents the value of an indicator in a particular country in a given year. The expenditure for each year is stored in the value column created (coincidentally, this column is given the name "value"). [image: A screenshot shows a data frame representing the reshaped educational data. It has five columns: Country.Name, Country.Code, Indicator.Name, year, and value, and 11 rows.] Figure 12.5 Reshaped educational data (long format by year). This structure allows you to more easily create visualizations across multiple years. This structure will now allow you to compare fluctuations in an indicator’s value over time (across all years): Click here to view code image # Filter the rows for the indicator and country of interest indicator <- "Government expenditure on education, total (% of GDP)" spain_plot_data <- long_year_data %>% filter( Indicator.Name == indicator, Country.Code == "ESP" # Spain ) %>% mutate(year = as.numeric(substr(year, 2, 5))) # remove "X" before each year # Show the educational expenditure over time chart_title <- paste(indicator, " in Spain") spain_chart <- ggplot(data = spain_plot_data) + geom_line(mapping = aes(x = year, y = value / 100)) + scale_y_continuous(labels = percent) + labs(title = chart_title, x = "Year", y = "Percent of GDP Expenditure") The resulting chart, shown in Figure 12.6, uses the available data to show a timeline of the fluctuations in government expenditures on education in Spain. This produces a more complete picture of the history of educational investment, and draws attention to major changes as well as the absence of data in particular years. [image: A discontinuous graph plot represents the fluctuations in government expenditures on education in Spain.] Figure 12.6 Education expenditures over time in Spain. The horizontal axis of the plot represents "Year" ranging from 1960 to 2000 in increments of 20 years. The vertical axis reads "Percentage of GDP expenditure" ranging from 2 to 5 percent in increments of 1 percent. A trend line with smooth fluctuations is plotted with discontinuities from (1972, 2.1) to (1978, 1.8); (1978, 2.4) to (1984, 3.3); (1996, 4.5) to (1998, 4.3). The plot has an increasing trend overall with minor ups and downs. Note: All the values are marked approximately. You may also want to compare two indicators to each other. For example, you may want to assess the relationship between each country’s literacy rate (a first indicator) and its unemployment rate (a second indicator). To do this, you would need to reshape the data again so that each observation is a particular country and each column is an indicator. Since indicators are currently in one column, you need to spread them out using the spread() function: Click here to view code image # Reshape the data to create columns for each indicator wide_data <- long_year_data %>% select(-Indicator.Code) %>% # do not include the `Indicator.Code` column spread( key = Indicator.Name, # new column names are `Indicator.Name` values value = value # populate new columns with values from `value` ) This wide format data shape allows for comparisons between two different indicators. For example, you can explore the relationship between female unemployment and female literacy rates, as shown in Figure 12.7. Click here to view code image # Prepare data and filter for year of interest x_var <- "Literacy rate, adult female (% of females ages 15 and above)" y_var <- "Unemployment, female (% of female labor force) (modeled ILO estimate)" lit_plot_data <- wide_data %>% mutate( lit_percent_2014 = wide_data[, x_var] / 100, employ_percent_2014 = wide_data[, y_var] / 100 ) %>% filter(year == "X2014") # Show the literacy vs. employment rates lit_chart <- ggplot(data = lit_plot_data) + geom_point(mapping = aes(x = lit_percent_2014, y = employ_percent_2014)) + scale_x_continuous(labels = percent) + scale_y_continuous(labels = percent) + labs( x = x_var, y = "Unemployment, female (% of female labor force)", title = "Female Literacy Rate versus Female Unemployment Rate" ) [image: A scatter plot represents the "Female Literacy Rate versus Female Unemployment Rate" in the year 2014.] Figure 12.7 Female literacy rate versus unemployment rate in 2014. The horizontal axis of the plot represents "Literacy rate, adult female (percentage of females ages 15 and older)" ranging from 0 to 20 percent in increments of 20. The vertical axis represents "Unemployment, female (percentage of the female labor force)" ranging from 0 to 40 percent in increments of 10. The plots are randomly scattered. For literacy rate from 20 to 70, the labor force percent is in the range of 0 to 11. For literacy rate from 70 to 100, the labor force percent is in the range of 0 to 40. The points are dense in the corresponding region where literacy rate range is 80 to 100 and labor force range is 0 to 15. Each comparison in this analysis—between two time points, over a full time-series, and between indicators—required a different representation of the data set. Mastering use of the tidyr functions will allow you to quickly transform the shape of your data set, allowing for rapid and effective data analysis. For practice reshaping data with the tidyr package, see the set of accompanying book exercises.5 5tidyr exercises: https://github.com/programming-for-data-science/chapter-12-exercises [image: Image] [image: Image] [image: Image] [image: Image] [image: Image] [image: Image] [image: Image] [image: Image] [image: Image] [image: Image] [image: Image] I Getting Started The first part of this book is designed to help you install necessary software for doing data science (Chapter 1), and to introduce you to the syntax needed to provide text-based instructions to your computer using the command line (Chapter 2). Note that all of the software that you will download is free, and instructions are included for both Mac and Windows operating systems. [image: Image] [image: Image] [image: Image] [image: Image] [image: Image] [image: Image] 15 Designing Data Visualizations Data visualization, when done well, allows you to reveal patterns in your data and communicate insights to your audience. This chapter describes the conceptual and design skills necessary to craft effective and expressive visual representations of your data. In doing so, it introduces skills for each of the following steps in the visualization process: Understanding the purpose of visualization Selecting a visual layout based on your question and data type Choosing optimal graphical encodings for your variables Identifying visualizations that are able to express your data Improving the aesthetics (i.e., making it readable and informative) 15.1 The Purpose of Visualization “The purpose of visualization is insight, not pictures.”1 1Card, S. K., Mackinlay, J. D., & Shneiderman, B. (1999). Readings in information visualization: Using vision to think. Burlington, MA: Morgan Kaufmann. Generating visual displays of your data is a key step in the analytical process. While you should strive to design aesthetically pleasing visuals, it’s important to remember that visualization is a means to an end. Devising appropriate renderings of your data can help expose underlying patterns in your data that were previously unseen, or that were undetectable by other tests. To demonstrate how visualization makes a distinct contribution to the data analysis process (beyond statistical tests), consider the canonical data set Anscombe’s Quartet (which is included with the R software as the data set anscombe). This data set consists of four pairs of x and y data: (x1, y1), (xx, y2), and so on. The data set is shown in Table 15.1. Table 15.1 Anscombe’s Quartet: four data sets with two features each x1 y1 x2 y2 x3 y3 x4 y4 10.00 8.04 10.00 9.14 10.00 7.46 8.00 6.58 8.00 6.95 8.00 8.14 8.00 6.77 8.00 5.76 13.00 7.58 13.00 8.74 13.00 12.74 8.00 7.71 9.00 8.81 9.00 8.77 9.00 7.11 8.00 8.84 11.00 8.33 11.00 9.26 11.00 7.81 8.00 8.47 14.00 9.96 14.00 8.10 14.00 8.84 8.00 7.04 6.00 7.24 6.00 6.13 6.00 6.08 8.00 5.25 4.00 4.26 4.00 3.10 4.00 5.39 19.00 12.50 12.00 10.84 12.00 9.13 12.00 8.15 8.00 5.56 7.00 4.82 7.00 7.26 7.00 6.42 8.00 7.91 5.00 5.68 5.00 4.74 5.00 5.73 8.00 6.89 The challenge of Anscombe’s Quartet is to identify differences between the four pairs of columns. For example, how does the (x1, y1) pair differ from the (x2, y2) pair? Using a nonvisual approach to answer this question, you could compute a variety of descriptive statistics for each set, as shown in Table 15.2. Given these six statistical assessments, these four data sets appear to be identical. However, if you graphically represent the relationship between each x and y pair, as in Figure 15.1, you reveal the distinct nature of their relationships. Table 15.2 Anscombe’s Quartet: the (X, Y) pairs share identical summary statistics Set Mean X Std. Deviation X Mean Y Std. Deviation Y Correlation Linear Fit 1 9.00 3.32 7.50 2.03 0.82 y = 3 + 0.5x 2 9.00 3.32 7.50 2.03 0.82 y = 3 + 0.5x 3 9.00 3.32 7.50 2.03 0.82 y = 3 + 0.5x 4 9.00 3.32 7.50 2.03 0.82 y = 3 + 0.5x [image: A figure shows an Anscombe's quartet to show how different the four sets appear when graphed.] Figure 15.1 Anscombe’s Quartet: scatterplots reveal four different (x, y) relationships that are not detectable using descriptive statistics. The four sets of Anscombe's Quartet are plotted on four different graphs to reveal different (x,y) relationships. The first set shows scattered, positive correlation; the second set shows the plots starting to fall after a steady ascent; the third set shows direct, positive correlation with a single outlier, and the fourth set shows a constant, vertical line of plots with one outlier. While computing summary statistics is an important part of the data exploration process, it is only through visual representations that differences across these sets emerge. The simple graphics in Figure 15.1 expose variations in the distributions of x and y values, as well as in the relationships between them. Thus the choice of representation becomes paramount when analyzing and presenting data. The following sections introduce basic principles for making that choice. 15.2 Selecting Visual Layouts The challenge of visualization, like many design challenges, is to identify an optimal solution (i.e., a visual layout) given a set of constraints. In visualization design, the primary constraints are: The specific question of interest you are attempting to answer in your domain The type of data you have available for answering that question The limitations of the human visual processing system The spatial limitations in the medium you are using (pixels on the screen, inches on the page, etc.) This section focuses on the second of these constraints (data type); the last two constraints are addressed in Section 15.3 and Section 15.4. The first constraint (the question of interest) is closely tied to Chapter 10 on understanding data. Based on your domain, you need to hone in on a question of interest, and identify a data set that is well suited for answering your question. This section will expand upon the same data set and question from Chapter 10: “What is the worst disease in the United States?” As with the Anscombe’s Quartet example, most basic exploratory data questions can be reduced to investigating how a variable is distributed or how variables are related to one another. Once you have mapped from your question of interest to a specific data set, your visualization type will largely depend on the data type of your variables. The data type of each column—nominal, ordinal, or continuous—will dictate how the information can be represented. The following sections describe techniques for visually exploring each variable, as well as making comparisons across variables. 15.2.1 Visualizing a Single Variable Before assessing relationships across variables, it is important to understand how each individual variable (i.e., column or feature) is distributed. The primary question of interest is often what does this variable look like? The specific visual layout you choose when answering this question will depend on whether the variable is categorical or continuous. To use the disease burden data set as an example, you may want to know what is the range of the number of deaths attributable to each disease. For continuous variables, a histogram will allow you to see the distribution and range of values, as shown in Figure 15.2. Alternatively, you can use a box plot or a violin plot, both of which are shown Figure 15.3. Note that outliers (extreme values) in the data set have been removed to better express the information in the charts. [image: Distribution of the number of deaths for cause in the United States.] Figure 15.2 The distribution of the number of deaths attributable to each disease in the United States (a continuous variable) using a histogram. Some outliers have been removed for demonstration. A histogram shows the distribution of the number of deaths due to each disease in the United States. The horizontal axis represents the number of deaths (marked from 0 to 40k) and the vertical axis represents the frequency (ranging from 0 to 40, in increments of 10). The frequency is highest (over 45) when then number of deaths is around 1-2k, and gradually decreases beyond the 10k mark. The frequency is consistently very low when the number of deaths is beyond 20k. [image: Distribution of the number of deaths - represented using a violin plot and a box plot.] Figure 15.3 Alternative visualizations for showing distributions of the number of deaths in the United States: violin plot (left) and box plot (right). Some outliers have been removed for demonstration. The number of deaths for each cause in the US is represented using a violin plot (left) and a box plot (right). Both plots show similar data - the violin plot is broader at the base and narrow toward the peak. The box plot shows the box ranging from 0 to 10k on the vertical axis with the median line at the point 2k on the vertical axis. Outliers are in the range 25k to 40k. While these visualizations display information about the distribution of the number of deaths by cause, they all leave an obvious question unanswered: what are the names of these diseases? Figure 15.4 uses a bar chart to label the top 10 causes of death, but due to the constraint of the page size, this display is able to express just a small subset of the data. In other words, bar charts don’t easily scale to hundreds or thousands of observations because they are inefficient to scan, or won’t fit in a given medium. [image: A bar chart shows the top causes of death in the US.] Figure 15.4 Top causes of death in the United States as shown in a bar chart. A horizontal bar chart shows the top 10 causes of death in the United States in the year 2016. The diseases are: (in the order of highest to lowest deaths) Ischemic heart disease; Alzheimer disease and other dementias; Tracheal, bronchus, and lung cancer; Cerebrovascular disease; Chronic obstructive pulmonary disease; Lower respiratory disease; Lower respiratory infections; Chronic kidney disease; Colon and rectum cancer; Diabetes mellitus; and Breast cancer. Ishcemic heart disease caused the most number of deaths (over 500k). Alzheimer disease and other dementias led to a little over 200k deaths. Lung cancer, Cerebrovascular disease, and Chronic obstructive pulmonary disease caused over 100k deaths, while the remaining diseases caused less than 100k deaths in 2016 alone. 220.127.116.11 Proportional Representations Depending on the data stored in a given column, you may be interested in showing each value relative to the total of the column. For example, using the disease burden data set, you may want to express each value proportional to the total number of deaths. This allows you answer the question, Of all deaths, what percentage is attributable to each disease? To do this, you can transform the data to percentages, or use a representation that more clearly expresses parts of a whole. Figure 15.5 shows the use of a stacked bar chart and a pie chart, both of which more intuitively express proportionality. You can also use a treemap, as shown later in Figure 15.14, though the true benefit of a treemap is expressing hierarchical data (more on this later in the chapter). Later sections explore the trade-offs in perceptual accuracy associated with each of these representations. [image: Top 10 causes of death in the US - represented using a stacked bar chart and a pie chart.] Figure 15.5 Proportional representations of the top causes of death in the United States: stacked bar chart (top) and pie chart (bottom). A figure graphically depics the top 10 causes of deaths in the United States using a stacked bar chart (top) and a pie chart (bottom). Both graphs show the similar data, but in different representations. They infer the following data. Ishcemic heart disease caused the most number of deaths (over 500k). Alzheimer disease and other dementias led to a little over 200k deaths. Lung cancer, Cerebrovascular disease, and Chronic obstructive pulmonary disease caused over 100k deaths, while the remaining diseases caused less than 100k deaths in 2016 alone. Note: the numbers are represented in the bar chart, and the pie chart represents only the proportions. If your variable of interest is a categorical variable, you will need to aggregate your data (e.g., count the number of occurrences of different categories) to ask similar questions about the distribution. Once doing so, you can use similar techniques to show the data (e.g., bar chart, pie chart, treemap). For example, the diseases in this data set are categorized into three types of diseases: non-communicable diseases, such as heart disease or lung cancer; communicable diseases, such as tuberculosis or whooping cough; and injuries, such as road traffic accidents or self harm. To understand how this categorical variable (disease type) is distributed, you can count the number of rows for each category, then display those quantitative values, as in Figure 15.6. [image: A bar chart shows the number of cases for 3 groups: non-communicable diseases, communicable diseases, and injuries.] Figure 15.6 A visual representation of the number of causes in each disease category: noncommunicable diseases, communicable diseases, and injuries. A horizontal bar chart shows the number of causes for non-communicable diseases, communicable diseases, and injuries. The horizontal axis represents the number of causes (ranging from 0 to 80, in increments of of 20) and the vertical axis shows the cause group (non-communicable, communicable, and injuries). The chart shows the highest number of causes (around 80) for non-communicable diseases, around 50 for communicable diseases, and a lowest of around 20 for causes by injuries. 15.2.2 Visualizing Multiple Variables Once you have explored each variable independently, you will likely want to assess relationships between or across variables. The type of visual layout necessary for making these comparisons will (again) depend largely on the type of data you have for each variable. For comparing relationships between two continuous variables, the best choice is a scatterplot. The visual processing system is quite good at estimating the linearity in a field of points created by a scatterplot, allowing you to describe how two variables are related. For example, using the disease burden data set, you can compare different metrics for measuring health loss. Figure 15.7 compares the disease burden as measured by the number of deaths due to each cause to the number of years of life lost (a metric that accounts for the age at death for each individual). [image: Deaths versus years of life lost.] Figure 15.7 Using a scatterplot to compare two continuous variables: the number of deaths versus the years of live lost for each disease in the United States. A scatter plot compares the number of years of life lost versus the number of deaths in the United States. The concentration of plots is highest when the number of deaths is less than 40k (roughly). At this region, the number of years of life lost varies around 5-10 million.There is one outlier - at over 500k deaths, when the number of years of life lost is around 75 million. You can extend this approach to multiple continuous variables by creating a scatterplot matrix of all continuous features in the data set. Figure 15.8 compares all pairs of metrics of disease burden, including number of deaths, years of life lost (YLLs), years lived with disability (YLDs, a measure of the disability experienced by the population), and disability-adjusted life years (DALYs, a combined measure of life lost and disability). [image: A four cross four matrix of various graphs comparing multiple continuous measurements of disease burden.] Figure 15.8 Comparing multiple continuous measurements of disease burden using a scatterplot matrix. The columns of the matrix represent DALYs, Deaths, YLDs, and YLLs, from left to right. The rows of the matrix from top to bottom represent DALYs, Deaths, YLDs, and YLLs. The horizontal axis of the DALYs column ranges from 0 to 50M. The horizontal axis of Deaths column ranges from 0 to 400k, in increments of 200k. The horizontal axis of YLDs column ranges from 0 to 40M, in increments of 20M. The horizontal axis of YLLs column ranges from 0 to 50M. The vertical axis of DALYs row ranges from 0 to 400, in increments of 100. The vertical axis of Deaths row ranges from 0 to 400k, in increments of 200k. The vertical axis of YLDs row ranges from 0 to 50M, in increments of 10M. The vertical axis of YLLs row ranges from 0 to 60M, in increments of 10M. For example, when a graph is plotted for YLDS against Deaths, the horizontal axis will range from 0 to 400k, in increments of 200k and the vertical axis will range from 0 to 50M, in increments of 10M. The scatter plot diagrams along the diagonal of the matrix show vertical bars plotted. The scatter plots above the diagonal show correlation values. The scatter plots below the diagonal show dots plotted. The dots are denser near the origin. When comparing relationships between one continuous variable and one categorical variable, you can compute summary statistics for each group (see Figure 15.6), use a violin plot to display distributions for each category (see Figure 15.9), or use faceting to show the distribution for each category (see Figure 15.10). [image: Distribution of the number of deaths by three causes - represented as violin plots.] Figure 15.9 A violin plot showing the continuous distributions of the number of deaths for each cause (by category). Some outliers have been removed for demonstration. A violin plot shows the distribution of the number of deaths for 3 causes - communicable diseases, injuries, and non-communicable diseases. The numbers are: around 10k for communicable diseases and around 45k for injuries and non-communicable diseases. The average deaths, as indicated by the thickness of the violin plot, is: 0-2k for communicable diseases, 0-10k for injuries, and 0-18k for non-communicable diseases. Note: the values are approximate. [image: Histograms showing the distribution of number of deaths for each cause.] Figure 15.10 A faceted layout of histograms showing the continuous distributions of the number of deaths for each cause (by category). Some outliers have been removed for demonstration. Three histograms show the continuous distribution of the number of deaths for each cause (communicable, injuries, and non-communicable diaseases). The frequency is highest for communicable diseases when the number of deaths is less than 1k. The total number of deaths is not higher than 10k. Injuries have lower frequencies (less than 5 on average) and the number of deaths is nearly nil beyond 10k deaths. For non-communicable diseases, the frequency lowers as the number of deaths increases. For assessing relationships between two categorical variables, you need a layout that enables you to assess the co-occurrences of nominal values (that is, whether an observation contains both values). A great way to do this is to count the co-occurrences and show a heatmap. As an example, consider a broader set of population health data that evaluates the leading cause of death in each country (also from the Global Burden of Disease study). Figure 15.11 shows a subset of this data, including the disease type (communicable, non-communicable) for each disease, and the region where each country is found. [image: A table shows the leading cause of death in 10 countries.] Figure 15.11 The leading cause of death in each country. The category of each disease (communi-cable, non-communicable) is shown, as is the region in which each country is found. The table lists 10 countries along with the region, leading cause of death, and category of the disease. The data observed is as follows. Botswana - Southern Sub-Saharan Africa, HIV/AIDS, Communicable; Brazil - Tropical Latin America, Ischemic heart disease, Non-communicable; Brunei - High-income Asia Pacific, Ischemic heart disease, Non-communicable; Bulgaria - Central Europe, Ischemic heart disease, Non-communicable; Burkina Faso - Western Sub-Saharan Africa, Malaria, Communicable; Burundi - Eastern Sub-Saharan Africa, Diarrheal diseases, Communicable; Cambodia - Southeast Asia, Lower respiratory infections, Communicable; Cameroon - Western Sub-Saharan Africa, HIV/AIDS, Communicable; Canada - High-income North America, Ischemic heart disease, Non-communicable; and Cape Verde - Western Sub-Saharan Africa, Ischemic heart disease, Non-communicable. One question you may ask about this categorical data is: “In each region, how often is the leading cause of death a communicable disease versus a non-communicable disease?” To answer this question, you can aggregate the data by region, and count the number of times each disease category (communicable, non-communicable) appears as the category for the leading cause of death. This aggregated data (shown in Figure 15.12) can then be displayed as a heatmap, as in Figure 15.13. [image: A table shows the number of countries in each region in which the leading cause of death is communicable/non-communicable disease.] Figure 15.12 Number of countries in each region in which the leading cause of death is communicable/non-communicable. The table lists 8 regions according to the category of leading cause (communicable/non-communicable) and the number of countries in the region. The data observed from the table is as follows. Andean Latin America - Communicable, 1; Andean Latin America - Non-Communicable, 2; Australasia - Non-Communicable, 2; Caribbean - Non-Communicable, 18; Central Asia - Non-Communicable, 9; Central Europe - Non-Communicable, 13; Central Latin America - Communicable, 1; and Central Latin America - Non-Communicable, 8. [image: A heatmap of the number of countries where the leading cause of death is communicable/non-communicable.] Figure 15.13 A heatmap of the number of countries in each region in which the leading cause of death is communicable/non-communicable. The horizontal axis represents the leading cause of death. The vertical axis represents the regions: Western Sub-Saharan Africa, Western Europe, Tropical Latin America, Southern Sub-Saharan Africa, Southern Latin America, Southeast Asia, South Asia, Oceania, North Africa and Middle East, High-income North America, Eastern Sub-Saharan Africa, Eastern Europe, East Asia, Central Sub-Saharan Africa, Central Latin America, Central Europe, Central Asia, Caribbean, Australasia, and Andean Latin America. The graph shows two bars representing communicable and non-communicable and in different shades indicating the number of countries from 1 to 20. 15.2.3 Visualizing Hierarchical Data One distinct challenge is showing a hierarchy that exists in your data. If your data naturally has a nested structure in which each observation is a member of a group, visually expressing that hierarchy can be critical to your analysis. Note that there may be multiple levels of nesting for each observation (observations may be part of a group, and that group may be part of a larger group). For example, in the disease burden data set, each country is found within a particular region, which can be further categorized into larger groupings called super-regions. Similarly, each cause of death (e.g., lung cancer) is a member of a family of causes (e.g., cancers), which can be further grouped into overarching categories (e.g., non-communicable diseases). Hierarchical data can be visualized using treemaps (Figure 15.14), circle packing (Figure 15.15), sunburst diagrams (Figure 15.16), or other layouts. Each of these visualizations uses an area encoding to represent a numeric value. These shapes (rectangles, circles, or arcs) are organized in a layout that clearly expresses the hierarchy of information. [image: A treemap showing the number of deaths in the U.S. for both sexes and all ages, in 2016.] Figure 15.14 A treemap of the number of deaths in the United States from each cause. Screenshot from GBD Compare, a visualization tool for the global burden of disease (https://vizhub.healthdata.org/gbd-compare/). The treemap shows three regions. The first region, filled in blue, includes IHD, stroke, lung c, liver c, stomach c, colorect C, breast c, oth neopla, leukemia, cervic C, lymphoma, prostate C, Esophag C, Lip oral C, Kidney C, Pancreas C, Ovary C, Bladder C, Uterus C, Myeloma, Melanoma, Skin C, Aort An, A fib, HTN HD, Oth Cardio, RHD, CMP, Endocar, PAD, Brain C, Drugs, Oth MSK, Diabetes CKD, Alzheimer, Urinary, Endocrine, Oth Neuro, Parkinson, MS, ALS, COPD, Asthma, Cirr HepC, Cirr Alc, Oth Cirr, Ileus, Oth Digest, Gall Bile, IBD, and Vasc Intest. The second region, filled in red, includes LRI and HIV. The third region, filled in green, includes Falls, Road Inj, Violence, Self Harm, Fire, and F body. [image: A circle pack layout represents the disease burden in the United States.] Figure 15.15 A re-creation of the treemap visualization (of disease burden in the United States) using a circle pack layout. Created using the d3.js library https://d3js.org. The circle pack layout encloses a bigger circle and two smaller circles. The various diseases visualized in the bigger circle with varying sizes of inner circles are (filled with blue) Breast C, COPD, Colorect C, CMP, Alzheimer's, Stroke, Drugs, Lung C, Prostate C, Oth Cardio, HTN HD, CKD, Diabetes, Pancreas C, and IHD. The diseases visualized in the two smaller circles are LRI (filled with red); Road Inj, Falls, and Self Harm (filled with green). [image: A sunburst chart represents the disease burden in the United States.] Figure 15.16 A re-creation of the treemap visualization (of disease burden in the United States) using a sunburst diagram. Created using the d3.js library https://d3js.org. The sunburst chart has a root node followed by two levels of concentric rings that are sliced to represent each category of diseases, the size of the sections corresponding to the value. Level 1 has three nodes, "Non-communicable" (covers the major portion, filled in blue) and two other parent nodes: "Injuries," that has an unnamed leaf node in the next level (both levels filled in green) and "Comm" parent node has its leaf node "LRI" in the next level (both levels filled in red). Level 2 has various nodes (segments): Alzheimer's, IHD, Stroke, COPD, Diabetes, CKD, Lung C, and Colorect C (all filled in blue). The benefit of visualizing the hierarchy of a data set, however, is not without its costs. As described in Section 15.3, it is quite difficult to visually decipher and compare values encoded in a treemap (especially with rectangles of different aspect ratios). However, these displays provide a great summary overview of hierarchies, which is an important starting point for visually exploring data. 15.3 Choosing Effective Graphical Encodings While the previously given guidelines for selecting visual layouts based on the data relationship to explore are a good place to start, there are often multiple ways to represent the same data set. Representing data in another format (e.g., visually) is called encoding that data. When you encode data, you use a particular “code” such as color or size to represent each value. These visual representations are then visually decoded by anyone trying to interpret the underlying values. Your task is thus to select the encodings that are most accurately decoded by users, answering the question: “What visual form best allows you to exploit the human visual system and available space to accurately display your data values?” In designing a visual layout, you should choose the graphical encodings that are most accurately visually decoded by your audience. This means that, for every value in your data, your user’s interpretation of that value should be as accurate as possible. The accuracy of these perceptions is referred to as the effectiveness of a graphical encoding. Academic research2 measuring the perceptiveness of different visual encodings has established a common set of possible encodings for quantitative information, listed here in order from most effective to least effective: Position: the horizontal or vertical position of an element along a common scale Length: the length of a segment, typically used in a stacked bar chart Area: the area of an element, such as a circle or a rectangle, typically used in a bubble chart (a scatterplot with differently sized markers) or a treemap Angle: the rotational angle of each marker, typically used in a circular layout like a pie chart Color: the color of each marker, usually along a continuous color scale Volume: the volume of a three-dimensional shape, typically used in a 3D bar chart 2Most notably, Cleveland, W. S., & McGill, R. (1984). Graphical perception: Theory, experimentation, and application to the development of graphical methods. Journal of the American Statistical Association, 79(387), 531–554. https://doi.org/10.1080/01621459.1984.10478080 As an example, consider the very simple data set in Table 15.3. An effective visualization of this data set would enable you to easily distinguish between the values of each group (e.g., between the values 10 and 11). While this identification is simple for a position encoding, detecting this 10% difference is very difficult for other encodings. Comparisons between encodings of this data set are shown in Figure 15.17. Table 15.3 A simple data set to demonstrate the perceptiveness of different graphical encodings (shown in Figure 15.17). Users should be able to visually distinguish between these values. group value a 1 b 10 c 11 d 7 e 8 [image: Different types of graphical encodings: Position Encoding, Length Encoding, Area Encoding, and Color Encoding are presented for the same dataset.] Figure 15.17 Different graphical encodings of the same data. Note the variation in perceptibility of differences between values! The horizontal axis for all the encoding types represents "group" that reads from 'a' to 'e'. The vertical axis for the position encoding represents "value" ranging from 0 to 9 in increments of 3. It is visualized as dots and it infers the following data: (a, 1), (b, 10), (c, 11.5), (d, 7), and (e, 8.5). The vertical axis for the length encoding represents "value" ranging from 0 to 30 in increments of 10. It is visualized in the form of bands from 'a' to 'e' and it infers the following data (from bottom to top): (band e, 0 to 8), (band d, 8 to 15), (band c, 15 to 25), (band b, 25 to 35), and (band a, 35 to 37). For area encoding, the vertical axis represents "value" that includes 0.0, 2.5, 5.0, 7.5, and 10.0. It is visualized as dots and the size of the dots increases as the value increases. All the plots are marked at the middle and thus infers the data: (a, 0.0), (b, 7.5), (c, 7.5), (d, 5.0), and (e, 5.0). For color encoding, the vertical axis represents "value" that includes 0.0, 2.5, 5.0, 7.5, and 10.0. It is visualized as small circles and the shades of the circles decrease as the value increases. All the circles are marked at the middle and thus infers the data: (a, 0.0), (b, 9.6), (c, 10.0), (d, 6.5), and (e, 7.5). Note: All the values are marked approximately. Thus when a visualization designer makes a blanket claim like “You should always use a bar chart rather than a pie chart,” the designer is really saying, “A bar chart, which uses position encoding along a common scale, is more accurately visually decoded compared to a pie chart (which uses an angle encoding).” To design your visualization, you should begin by encoding the most important data features with the most accurately decoded visual features (position, then length, then area, and so on). This will provide you with guidance as you compare different chart options and begin to explore more creative layouts. While these guidelines may feel intuitive, the volume and distribution of your data often make this task more challenging. You may struggle to display all of your data, requiring you to also work to maximize the expressiveness of your visualizations (see Section 15.4). 15.3.1 Effective Colors Color is one of the most prominent visual encodings, so it deserves special consideration. To describe how to use color effectively in visualizations, it is important to understand how color is measured. While there are many different conceptualizations of color spaces, a useful one for visualization is the hue–saturation–lightness (HSL) model, which defines a color using three attributes: The hue of a color, which is likely how you think of describing a color (e.g., “green” or “blue”) The saturation or intensity of a color, which describes how “rich” the color is on a linear scale between gray (0%) and the full display of the hue (100%) The lightness of the color, which describes how “bright” the color is on a linear scale from black (0%) to white (100%) This color model can be seen in Figure 15.18, which is an example of an interactive color selector3 that allows you to manipulate each attribute independently to pick a color. The HSL model provides a good foundation for color selection in data visualization. 3HSL Calculator by w3schools: https://www.w3schools.com/colors/colors_hsl.asp [image: A snapshot of the H S L Calculator.] Figure 15.18 An interactive hue–staturation–lightness color picker, from w3schools. The screen shows a thumbnail view of the selected color on the left. The text box on the right reads, hsl (153, 87 percent, 55 percent). Three fields along with sliders are shown at the bottom: H set to 153, S set to 87, and L set to 55. When selecting colors for visualization, the data type of your variable should drive your decisions. Depending on the data type (categorical or continuous), the purpose of your encoding will likely be different: For categorical variables, a color encoding is used to distinguish between groups. Therefore, you should select colors with different hues that are visually distinct and do not imply a rank ordering. For continuous variables, a color encoding is used to estimate values. Therefore, colors should be picked using a linear interpolation between color points (i.e., different lightness values). Picking colors that most effectively satisfy these goals is trickier than it seems (and beyond the scope of this short section). But as with any other challenge in data science, you can build upon the open source work of other people. One of the most popular tools for picking colors (especially for maps) is Cynthia Brewer’s ColorBrewer.4 This tool provides a wonderful set of color palettes that differ in hue for categorical data (e.g., “Set3”) and in lightness for continuous data (e.g., “Purples”); see Figure 15.19. Moreover, these palettes have been carefully designed to be viewable to people with certain forms of color blindness. These palettes are available in R through the RColorBrewer package; see Chapter 16 for details on how to use this package as part of your visualization process. 4ColorBrewer: http://colorbrewer2.org [image: All palettes in the colorbrewer package are displayed.] Figure 15.19 All palettes made available by the colorbrewer package in R. Run the display. brewer.all() function to see them in RStudio. The package includes the following: YlOrRd, YlOrBr, YlGnBu, ylGn, Reds, RdPu, Purples, PuRd, PuBuGn, PuBu, OrRd, Oranges, Greys, Greens, GnBu, BuPu, BuGn, Blues, Set3, Set2, Set1, Pastel2, Pasterl1, Paired, Dark2, Accent, Spectral, RdYlGn, RdYlBu, RdGy, RdBu, PuOr, PRGn, PiYG, and BrBG. Selecting between different types of color palettes depends on the semantic meaning of the data. This choice is illustrated in Figure 15.20, which shows map visualizations of the population of each county in Washington state. The choice between different types of continuous color scales depends on the data: Sequential color scales are often best for displaying continuous values along a linear scale (e.g., for this population data). Diverging color scales are most appropriate when the divergence from a center value is meaningful (e.g., the midpoint is zero). For example, if you were showing changes in population over time, you could use a diverging scale to show increases in population using one hue, and decreases in population using another hue. Multi-hue color scales afford an increase in contrast between colors by providing a broader color range. While this allows for more precise interpretations than a (single hue) sequential color scale, the user may misinterpret or misjudge the differences in hue if the scale is not carefully chosen. Black and white color scales are equivalent to sequential color scales (just with a hue of gray!) and may be required for your medium (e.g., when printing in a book or newspaper). [image: A set of four Washington maps showing the population data in four different ColorBrewer Scales: Sequential, Greens; Diverging, Red/Blue, Multi-Hue, Green/Blue, and Black/White.] Figure 15.20 Population data in Washington represented with four ColorBrewer scales. The sequential and black/white scales accurately represent continuous data, while the diverging scale (inappropriately) implies divergence from a meaningful center point. Colors in the multi-hue scale may be misinterpreted as having different meanings. Overall, the choice of color will depend on the data. Your goal is to make sure that the color scale chosen enables the viewer to most effectively distinguish between the data’s values and meanings. 15.3.2 Leveraging Preattentive Attributes You often want to draw attention to particular observations in your visualizations. This can help you drive the viewer’s focus toward specific instances that best convey the information or intended interpretation (to “tell a story” about the data). The most effective way to do this is to leverage the natural tendencies of the human visual processing system to direct a user’s attention. This class of natural tendencies is referred to as preattentive processing: the cognitive work that your brain does without you deliberately paying attention to something. More specifically, these are the “[perceptual] tasks that can be performed on large multi-element displays in less than 200 to 250 milliseconds.”5 As detailed by Colin Ware,6 the visual processing system will automatically process certain stimuli without any conscious effort. As a visualization designer, you want to take advantage of visual attributes that are processed preattentively, making your graphics as rapidly understood as possible. 5Healey, C. G., & Enns, J. T. (2012). Attention and visual memory in visualization and computer graphics. IEEE Transactions on Visualization and Computer Graphics, 18(7), 1170–1188. https://doi.org/10.1109/TVCG.2011.127. Also at: https://www.csc2.ncsu.edu/faculty/healey/PP/ 6Ware, C. (2012). Information visualization: Perception for design. Philadelphia, PA: Elsevier. As an example, consider Figure 15.21, in which you are able to count the occurrences of the number 3 at dramatically different speeds in each graphic. This is possible because your brain naturally identifies elements of the same color (more specifically, opacity) without having to put forth any effort. This technique can be used to drive focus in a visualization, thereby helping people quickly identify pertinent information. [image: Two graphics on the left and right depict preattentive attributes.] Figure 15.21 Because opacity is processed preattentively, the visual processing system identifies elements of interest (the number 3) without effort in the right graphic, but not in the left graphic. The graphic on the left and right shows a big sequence of numbers with a text above it reading How many 3s are there? In the graph on the right, the opacity of number 3 is high, when compared to other numbers. In addition to color, you can use other visual attributes that help viewers preattentively distinguish observations from those around them, as illustrated in Figure 15.22. Notice how quickly you can identify the “selected” point—though this identification happens more rapidly with some encodings (i.e., color) than with others! [image: A set of five graphs illustrates the preattentive attributes.] Figure 15.22 Driving focus with preattentive attributes. The selected point is clear in each graph, but especially easy to detect using color. The horizontal (x) and vertical (y) axes of all the graphs range from 0.0 to 10.0, in increments of 2.5. The first graph titled color shows all plots in black except one in red. The second graph titled shape shows all plots as circles except one as a cross mark. The third graph titled enclosure shows all plots, except one enclosed in a rectangle. The fourth graph titled opacity shows all plots are dimmed, except one, which is bold. The fifth graph titled size shows all points of same size, except one which is larger than the rest. As you can see, color and opacity are two of the most powerful ways to grab attention. However, you may find that you are already using color and opacity to encode a feature of your data, and thus can’t also use these encodings to draw attention to particular observations. In that case, you can consider the remaining options (e.g., shape, size, enclosure) to direct attention to a specific set of observations. 15.4 Expressive Data Displays The other principle you should use to guide your visualization design is to choose layouts that allow you to express as much data as possible. This goal was originally articulated as Mackinlay’s Expressiveness Criteria7 (clarifications added): 7Mackinlay, J. (1986). Automating the design of graphical presentations of relational information. ACM Transactions on Graphics, 5(2), 110–141. https://doi.org/10.1145/22949.22950. Restatement by Jeffrey Heer. A set of facts [data] is expressible in a language [visual layout] if that language contains a sentence [form] that encodes all the facts in the set, encodes only the facts in the set. The prompt of this expressiveness aim is to devise visualizations that express all of (and only) the data in your data set. The most common barrier to expressiveness is occlusion (overlapping data points). As an example, consider Figure 15.23, which visualizes the distribution of the number of deaths attributable to different causes in the United States. This chart uses the most visually perceptive visual encoding (position), but fails to express all of the data due to the overlap in values. [image: A chart representing the distribution of Number of deaths in the United States.] Figure 15.23 Position encoding of the number of deaths from each cause in the United States. Notice how the overlapping points (occlusion) prevent this layout from expressing all of the data. Some outliers have been removed for demonstration. The horizontal band representing the number of deaths for each cause ranges from 0 to 40k, in increments of 10k. The chart shows solid circles marked along the axis representing the appropriate data. The circles from 0 to 20k are found to be dense and overlapping each other. The circles from 20k to 30k are lesser than the previous range and yet overlap each other. The circles from 30k to 40k are very less in number and do not overlap. There are two common approaches to address the failure of expressiveness caused by overlapping data points: Adjust the opacity of each marker to reveal overlapping data. Break the data into different groupings or facets to alleviate the overlap (by showing only a subset of the data at a time). These approaches are both implemented in combination in Figure 15.24. [image: A set of three charts representing the distribution of Number of deaths from each cause in the United States.] Figure 15.24 Position encoding of the number of deaths from each cause in the United States, faceted by the category of each cause. The use of a lower opacity in conjunction with the faceting enhances the expressiveness of the plots. Some outliers have been removed for demonstration. The horizontal band of all the charts represents the number of deaths for each cause and it ranges from 0 to 40k, in increments of 10k. The first chart titled communicable shows solid circles overlapping in the range 0 to 10k. The second chart titled injuries shows solid circles overlapping in the ranges 0 to 10k, near 20k, near 35k, and after 40k. The third chart titled non-communicable shows solid circles overlapping each other throughout the band. The circles are dense near the left end. Alternatively, you could consider changing the data that you are visualizing by aggregating it in an appropriate way. For example, you could group your data by values that have similar number of deaths (putting each into a “bin”), and then use a position encoding to show the number of observations per bin. The result of this is the commonly used layout known as a histogram, as shown in Figure 15.25. While this visualization does communicate summary information to your audience, it is unable to express each individual observation in the data (which would communicate more information through the chart). [image: A histogram named "Distribution of the Number of Deaths for Each Cause" represents the number of deaths attributable to each cause.] Figure 15.25 Histogram of the number of deaths attributable to each cause. The horizontal axis of the histogram represents "Number of Deaths" ranging from 0 to 40k (in thousand) in increments of 10k. The vertical axis represents "Number of Causes" ranging from 0 to 40 in increments of 10. The graph infers the following data: (0, 47), (1k, 15), (2k, 10), (6k, 10), (9k, 7), (20k, 0), (39k, 1), and (44k, 1). Note: All the values are marked approximately. At times, the expressiveness and effectiveness principles are at odds with one another. In an attempt to maximize expressiveness (and minimize the overlap of your symbols), you may have to choose a less effective encoding. While there are multiple strategies for this—for example, breaking the data into multiple plots, aggregating the data, and changing the opacity of your symbols—the most appropriate choice will depend on the distribution and volume of your data, as well as the specific question you wish to answer. 15.5 Enhancing Aesthetics Following the principles described in this chapter will go a long way in helping you devise informative visualizations. But to gain trust and attention from your potential audiences you will also want to spend time investing in the aesthetics (i.e., beauty) of your graphics. Tip Making beautiful charts is a practice of removing clutter, not adding design. One of the most renowned data visualization theorists, Edward Tufte, frames this idea in terms of the data–ink ratio.8 Tufte argues that in every chart, you should maximize the ink dedicated to displaying the data (and in turn, minimize the non-data ink). This can translate to a number of actions: 8Tufte, E. R. (1986). The visual display of quantitative information. Cheshire, CT: Graphics Press. Remove unnecessary encodings. For example, if you have a bar chart, the bars should have different colors only if that information isn’t otherwise expressed. Avoid visual effects. Any 3D effects, unnecessary shading, or other distracting formatting should be avoided. Tufte refers to this as “chart junk.” Include chart and axis labels. Provide a title for your chart, as well as meaningful labels for your axes. Lighten legends/labels. Reduce the size or opacity of axis labels. Avoid using striking colors. It’s easy to look at a chart such as the chart on the left side of Figure 15.26 and claim that it looks unpleasant. However, describing why it looks distracting and how to improve it can be more challenging. If you follow the tips in this section and strive for simplicity, you can remove unnecessary elements and drive focus to the data (as shown on the right-hand side of Figure 15.26). [image: A set of two vertical bars on the left and right illustrates enhancing the visualization and addition of informative labels.] Figure 15.26 Removing distracting and uninformative visual features (left) and adding informative labels to create a cleaner chart (right). The graph on the left shows the horizontal axis marked with values: a, b, c, d, and e. The vertical axis ranges from 0 to 12, in increments of 2. The graph shows five bars in different colors. The graph on the right shows the horizontal axis labeled group marked with the values: a, b, c, d, and e. The vertical axis labeled size ranges from 0 to 9, in increments of 3. The graph represents the group size data. Luckily, many of these optimal choices are built into the default R packages for visualization, or are otherwise readily implemented. That being said, you may have to adhere to the aesthetics of your organization (or your own preferences!), so choosing an easily configurable visualization package (such as ggplot2, described in Chapter 16) is crucial. As you begin to design and build visualizations, remember the following guidelines: Dedicate each visualization to answering a specific question of interest. Select a visual layout based on your data type. Choose optimal graphical encodings based on how well they are visually decoded. Ensure that your layout is able to express your data. Enhance the aesthetics by removing visual effects, and by including clear labels. These guidelines will be a helpful start, and don’t forget that visualizations are about insights, not pictures. [image: Image] About This E-Book EPUB is an open, industry-standard format for e-books. However, support for EPUB and its many features varies across reading devices and applications. Use your device or app settings to customize the presentation to your liking. Settings that you can customize often include font, font size, single or double column, landscape or portrait mode, and figures that you can click or tap to enlarge. For additional information about the settings and feat