Main Secondary Data Analysis (Pocket Guides to Social Work Research Methods)

Secondary Data Analysis (Pocket Guides to Social Work Research Methods)

In recent decades, social work and other social science research disciplines have become increasingly reliant on large secondary data sets, which have increased in both number and accessibility. When starting a new research project, how does one determine whether to use a secondary data set? Which of the thousands available should be used? This invaluable and expertly written guide provides an in-depth introduction to 29 of the most widely used data sets in social work, such as the Early Childhood Longitudinal Study, the National Health and Nutrition Examination Survey, and the U.S. Census. This book also examines the years covered by these cross-sectional and longitudinal data sets, the units of analysis, and the sample sizes.Readers will learn where to find the data and the key variables contained within, and how to use them in SAS and Stata. Screen shots guide researchers through data sets in a step-by-step process: how to download the data, how to merge it with other data sets, and how to program it when necessary. Each section also profiles studies that have used the respective data sets, giving researchers a clear feel for the depth and range of questions that a given data source can be used to answer, like the use of government data to explore issues ranging from pathways out of poverty to the relationship between marital dissolution and women's health and well-being. Exceptionally well calibrated and filled with real-world examples, this pocket guide will give beginning and advanced researchers a comprehensive understanding of these data sets that they can use in their research on clinical, policy, and other types studies.
Year: 2010
Edition: 1
Publisher: Oxford University Press, USA
Language: english
Pages: 225
ISBN 13: 9780195388817
ISBN: 019538881X
File: PDF, 2.36 MB
Download (pdf, 2.36 MB)

You may be interested in


Most frequently terms

You can write a book review and share your experiences. Other readers will always be interested in your opinion of the books you've read. Whether you've loved the book or not, if you give your honest and detailed thoughts then people will find new books that are right for them.
Secondary Data Analysis

Series Editor
Tony Tripodi, DSW
Professor Emeritus, Ohio State University

Determining Sample Size:
Balancing Power, Precision, and
Patrick Dattalo
Preparing Research Articles
Bruce A. Thyer
Systematic Reviews and
Julia H. Littell, Jacqueline
Corcoran, and Vijayan Pillai
Historical Research
Elizabeth Ann Danto
Confirmatory Factor Analysis
Donna Harrington
Randomized Controlled Trials:
Design and Implementation for
Community-Based Psychosocial
Phyllis Solomon, Mary M.
Cavanaugh, and Jeffrey Draine

Intervention Research:
Developing Social Programs
Mark W. Fraser, Jack M.
Richman, Maeda J. Galinsky, and
Steven H. Day
Developing and Validating Rapid
Assessment Instruments
Neil Abell, David W. Springer,
and Akihito Kamata
Clinical Data-Mining:
Integrating Practice and Research
Irwin Epstein
Strategies to Approximate Random
Sampling and Assignment
Patrick Dattalo
Analyzing Single System
Design Data
William R. Nugent
Survival Analysis
Shenyang Guo

Needs Assessment
David Royse, Michele
Staton-Tindall, Karen Badger, and
J. Matthew Webster

The Dissertation:
From Beginning to End
Peter Lyons and Howard J.

Multiple Regression with Discrete
Dependent Variables
John G. Orme and Terri

Cross-Cultural Research
Jorge Delva, Paula Allen-Meares,
and Sandra L. Momper

Developing Cross-Cultural
Thanh V. Tran

Secondary Data Analysis
Thomas P. Vartanian


Secondary Data Analysis


Oxford University Press, Inc., publishes works that further
Oxford University’s objective of excellence
in research, scholarship, and education.
Oxford New York
Auckland Cape Town Dar es Salaam Hong Kong Karachi
Kuala Lumpur Madrid Melbourne Mexico City Nairobi
New Delhi Shanghai Taipei Toronto
With offices in
Argentina Austria Brazil Chile Czech Republic France Greece
Guatemala Hungary Italy Japan Poland Portugal Singapore
South Korea Switzerland Thailand Turkey Ukraine Vietnam

Copyright © 2011 by Oxford University Press, Inc.
Published by Oxford University Press, Inc.
198 Madison Avenue, New York, New York 10016
Oxford is a registered trademark of Oxford University Press
All rights reserved. No part of this publication may be reproduced,
stored in a retrieval system, or transmitted, in any form or by any means,
electronic, mechanical, photocopying, recording, or otherwise,
without the prior permission of Oxford University Press.
Library of Congress Cataloging-in-Publication Data
Vartanian, Thomas P.
Secondary data analysis / Thomas P. Vartanian.
p. cm. — (Pocket guides to social work research methods)
Includes bibliographical references and index.
ISBN 978-0-19-538881-7
1. Social service—Research. I. Title.
HV11.V347 2011
1 3 5 7 9 8 6 4 2
Printed in the United States of America
on acid-free paper


I would like to thank Barb Toews, Marie Guldin, and Molly Graepel for
their research assistance on this book, and Philip Gleason and Linda
Houser for their helpful comments during the writing of this book.
Funding for this research was provided by Bryn Mawr College.

This page intentionally left blank


1 Introduction


2 What is a Secondary Data Set? 9
3 Advantages, Disadvantages, Feasability, and Appropriateness of
Using Secondary Data 13
Advantages of Secondary Data
Disadvantages to Secondary Data
Determining the Feasibility and Appropriateness of Using
Secondary Data
4 Secondary Datasets


Adoption and Foster Care Analysis and Reporting System
Child Neglect: Cross Sector Service Path and Outcomes
Common Core of Data
Continuing Survey of Food Intake by Individuals
Current Population Survey
Developmental Victimization Survey
Early Childhood Longitudinal Survey
Fragile Family and Child Well-Being Study
General Social Survey
Health and Retirement Study
Longitudinal Studies Of Child Abuse and Neglect


National Child Abuse and Neglect Data System
National Educational Longitudinal Survey
National Health and Nutrition Examination Survey
The National Longitudinal Study of Adolescent Health
National Longitudinal Surveys
National Medical Expenditure Survey/Medical Expenditure
Panel Survey
National Survey of American Families
National Survey of Child and Adolescent Well-Being
NICHD Study of Early Child Care and Youth Development
The Panel Study of Income Dynamics
Panel Study of Income Dynamics, Child Development Supplement
Project on Human Development in Chicago Neighborhoods
Public-Use Microdata Samples
School Data Direct
Survey of Income and Program Participation
Survey of Program Dynamics
United States Census
Welfare, Children, and Families: A Three City Study

Appendix Tables


References 176
Index 209


Secondary Data Analysis

This page intentionally left blank




ocial work research has become increasingly reliant on large secondary data sets. These data sets, generally collected by governments,
research institutions, and, in some cases, agencies, provide researchers
with readily available resources to examine characteristics of populations
or particular hypotheses. These data differ from primary data in that
primary data sets are collected by the researcher who will also examine
that data. Researchers collect primary data directly through interviews,
questionnaires, focus groups, observation, the examination of primary
sources such as writings or speeches, or a variety of other such collection
methods. While collecting data is often the best way to obtain the information necessary to analyze particular hypotheses, it is not always economically or practically feasible. Using large secondary data sets provides
an alternative to the collection of primary data, often giving the researcher
access to more information than would be available in primary data sets.
Secondary data can include any data that are examined to answer a
research question other than the question(s) for which the data were initially collected. Large institutions are able to obtain far larger samples
and often are able to ask more questions than researchers who are in
smaller settings (such as individual or small-group researchers). Over
time, data sets have become richer, as researchers refine the types of
questions asked in surveys. That being said, many secondary data sets
available were primary data sets when they first started, and they grew to


Secondary Data Analysis

become the larger data sets that they now are. Also, many of the secondary data sets available today look to primary researchers, often qualitative
researchers, for the questions that they ask their sample members.
Without such fieldwork by primary researchers, larger secondary data
sets would not be as rich as they are.
To give an example of how research has changed over time, I examine Social Service Review(SSR) during 1980 and 2007 (I randomly picked
these two years) to determine the data sources for articles published in
those years. In 1980, approximately 32 articles were published in SSR as
main articles or notes, and, of these, six used some form of secondary
data, either administrative or survey data; nine used primary data
sources; and 17 of the articles used no data. Things changed dramatically
by 2007, when, of the 22 articles published that year, 18 used some form
of large data set, primarily secondary survey data; one used primary data;
and three used no data. Although this is only a “snapshot” of a trend
based on one elite social work journal, it would appear that the use of
secondary data is becoming increasingly important.
In this book, I will briefly discuss what secondary data sets look like,
as well as some of the advantages and disadvantages of collecting and
using primary, and secondary, data in a research study. I will take you
through a series of questions to help you decide which type of data will
work best for your research question. I then turn my attention to the
central topic of the book, the use of secondary data sets.
This book examines the types of secondary data sets available to
researchers and how these have been used in the past and may be used in
the future. While thousands of secondary data sets are available from
private sources, universities, federal and other government agencies, and
other public sources, I focus on data sets that are often used by social
work and other social science researchers. Many of these data sets cover
a wide variety of topics; others focus on particular topics or populations.
Some are longitudinal, while others are cross-sectional, and they can
cover either short or long periods of time. Some data sets use monthly
information;others use annual or biannual information. Some data sets
are nationally representative; others cover only specific populations.
I examine the costs and benefits of using these different types of data sets
and the reasons for using particular data sets given particular types of
research questions or populations. I also describe how to get the data,


where to find the codebooks that describe the variables contained in
the data, the costs involved in the use of the data, and, in some instances,
how to best use the data for analyses. Some of these data are relatively
easy to access, generally by downloading them from websites, while
others require contracts and cost a good deal of money to obtain.
I describe surveys that cover a wide range of populations and topics.
I first do this by going over some of the details, characteristics, populations, and variable types of 29 popular and informative surveys. I also
examine which of these data sets are best used for particular types of
research questions. For example, if you are interested in the study of the
effect of childhood factors on particular adult outcomes, a number of
longitudinal data sets span a great number of years and include children
who become adults over the years of the survey.
Below, I give an overview of some of the populations and topics I
cover in the book. Many of the data sets that I examine contain numerous populations and topic areas, and I include particular data sets in
several topic or population areas.
For childhood data sets, I look at several data sets in which children’s
mental, emotional, and physical health; bonds with parents; and economic well-being are examined. These include:
• The Panel Study of Income Dynamics, Child Development
Supplement (CDS),
• The Project on Human Development in Chicago Neighborhoods
• The National Institute of Child Health and Human Development
(NICHD) Study of Early Child Care And Youth Development,
• The Fragile Families and Child Wellbeing Study (FFCWS),
• The National Survey of Child and Adolescent Well-Being,
• The National Survey of American Families (NSAF),
• The National Longitudinal Study of Adolescent Health
(Add Health),
• The Early Childhood Longitudinal Survey (ECLS),
• The National Educational Longitudinal Survey (NELS),
• Welfare, Children and Families: A Three-City Study, and
• The Adoption and Foster Care Analysis and Reporting System



Secondary Data Analysis

I describe data for child populations that suffer from abuse and neglect,
such as:

The National Child Abuse and Neglect Data System (NCANDS),
Longitudinal Studies of Child Abuse and Neglect (LONGSCAN),
Developmental Victimization Survey (DVS),
Child Neglect: Cross Sector Service Paths and Outcomes (CN),
• The Project on Human Development in Chicago Neighborhoods
I also look at a number of surveys related to schooling, which include
information relating to grades completed, types of schools attended,
grades in particular subject areas, and level of education of populations:

The National Educational Longitudinal Survey (NELS),
School Data Direct (SDD),
Common Core of Data (CCD),
The Panel Study of Income Dynamics, Child Development
Supplement (CDS),
Fragile Families and Child Wellbeing Study (FFCWS),
The Project on Human Development in Chicago Neighborhoods
The Early Childhood Longitudinal Survey (ECLS),
The Survey of Income and Program Participation (SIPP),
The U.S. Census, and
Welfare, Children, and Families: A Three City Study.

I examine surveys that ask about adult and children’s health outcomes, including:
• The Panel Study of Income Dynamics, Child Development
Supplement (CDS),
• Fragile Families and Child Wellbeing Study (FFCWS),
• The Health and Retirement Study,
• The Survey of Income and Program Participation (SIPP),
• The National Longitudinal Study of Adolescent Health
(Add Health),



The Early Childhood Longitudinal Survey (ECLS),
The General Social Survey (GSS),
The National Longitudinal Survey (NLS),
The National Survey of American Families (NSAF),
National Survey of Child and Adolescent Well-Being
• The National Institute of Child Health and Human Development
(NICHD) Study of Early Child Care and Youth Development,
• National Medical Expenditure Survey (NMES)/Medical
Expenditure Panel Survey (MEPS).
I also examine which data sets are best for use to study those who use
government programs, such as Temporary Assistance for Needy Families
(TANF, formerly Aid to Families with Dependent Children), Supplemental Nutritional Assistance Program (SNAP, formerly the Food Stamp
Program), Supplemental Security Income (SSI), Social Security (SS),
Medicaid, Medicare, and other such programs. These data sets include:

The Panel Study of Income Dynamics (PSID),
The Child Development Supplement (CDS),
Fragile Families and Child Wellbeing Study (FFCWS),
The National Longitudinal Survey (NLS),
Survey of Program Dynamics (SPD),
Welfare, Children and Families: A Three-City Study,
Early Childhood Longitudinal Survey (ECLS),
Continuing Survey of Food Intake by Individuals (CSFII),
National Survey of American Families (NSAF), and
The Survey of Income and Program Participation (SIPP).

I also examine data sets that work best for those who are elderly or
going into retirement:

The Health and Retirement Survey (HRS),
The National Longitudinal Survey of Older Men (NLS),
The National Longitudinal Survey of Older Women (NLS),
The Panel Study of Income Dynamics (PSID), and
The Current Population Survey (CPS).



Secondary Data Analysis

Other topic areas that can be examined with these data sets and will
be discussed in the book include child care, mental health, neighborhood
perceptions and characteristics, food insecurity, housing, income and
poverty, birth weight, sexual activity, sexually transmitted diseases,
physical activity, prescription and illegal drug use, dating and domestic
violence, home environment, and emotional and general well-being.


What is a Secondary Data Set?


large secondary data set typically covers a broad sample of individuals or other entities (e.g., schools, hospitals) and is generally representative of some broader population—if not for the entire U.S. population,
then for some subpopulation or region of the country. Most primary data
sets are not as comprehensive as many large secondary data sets in representing either the entire population or large segments of the population.
Some of the data sets covered in the book are nationally representative of
the U.S. population, such as the Panel Study of Income Dynamics (PSID)
(for the nonimmigrant U.S. population), the National Longitudinal
Study (NLS), and the Survey of Income and Program Participation
(SIPP). Other samples are representative only of certain populations; for
example, the Project on Human Development in Chicago Neighborhoods
(PHDCN) is representative of all individuals in the city of Chicago; the
National Child Abuse and Neglect Data System (NCANDS) is representative of those children whose alleged victimization was reported to child
and protective services; the National Educational Longitudinal Survey
(NELS) is representative of eighth graders in 1988; and the Developmental
Victimization Survey (DVS) is representative of children aged 2–17 years
old living in the contiguous United States. The data sets for the entire
population generally have smaller sample sizes for specific subgroups,
posing potential problems for researchers interested in those subgroups.
On the other hand, surveys for specific populations may have larger


Secondary Data Analysis

samples than the more general surveys (such as the PSID), but may offer
researchers limited opportunities for comparison with other populations.
The data sets being examined in this book generally use sophisticated
sampling designs to obtain, at a reasonable cost, a sample that is both
fairly large and representative of either the broad population or the
specific population of their study. For example, the PSID, which started
in 1968, comprises two independent national samples: a cross-sectional
sample [called the Survey Research Center (SRC) sample], based on
stratified multistage selection of the civilian noninstitutional population
of the United States, and a sample of low-income families [called the
Survey of Economic Opportunities (SEO) sample]. Both of these are
probability samples (samples that use some form of random selection
within a known population). Some groups in the PSID, such as African
Americans and those living in large, urban areas, were oversampled in
order to obtain large sample sizes for these groups. Sampling weights are
then used to make the PSID representative of the U.S. population.
Other data sets use similar types of methodology to make them
nationally representative, often oversampling particular groups of interest (such as welfare recipients or those who are food insecure), and providing sampling weights so that the data is representative of some
population. These types of sampling strategies are possible for data sets
such as the PSID because the PSID is funded by a large variety of government agencies, including the Office of Economic Opportunity of the U.S.
Department of Commerce, the National Science Foundation, the
National Institute on Aging, and the National Institute of Child Health
and Human Development. Several private foundations also help support
the PSID. The annual cost of running the PSID, including interviewing,
is somewhere in the area of $3.24 million (in 2009 dollars) (Duncan,
Other organizations that collect survey data use similarly complex
sampling strategies to obtain representative samples. The Fragile Families
and Child Well Being study, for example, is administered through a joint
effort by the Center for Research on Child Wellbeing and the Center for
Health and Wellbeing at Princeton University, and the Columbia
Population Research Center and The National Center for Children and
Families at Columbia University. The Fragile Families sample is drawn
by randomly sampling births, from within randomly selected hospitals,
from large U.S. cities using stratified random selection. Prior to random

What is a Secondary Data Set?

sampling, large U.S. cities were grouped (i.e., stratified) according to
their policy environments and labor market conditions in order to
ensure representation from a range of policy environments. The complexity of the sampling strategy demanded equally complex techniques
for weighting the data to be representative either of large U.S. cities with
populations of 200,000 or more or of the 20 sampled cities. Data were
collected by Mathematica Policy Research, a firm specializing in such
data collection. The resultant sample is distinguished by its utility for
studies of state and city policy environments, its focus on nonmarital
households, and its efforts to interview fathers, both at their children’s
birth and thereafter. The researchers were able to get information from
three-quarters of nonmarital fathers, making these data richer and more
complete than previous studies of single mothers.
Some large data sets, such as the Survey of Income and Program
Participation (SIPP), the U.S. Census, the Public-Use Microdata Samples,
the Current Population Survey (CPS), and the National Educational
Longitudinal Survey, are conducted by the Federal Government. Each of
these data sets has complex survey methods and large sample sizes, and is
nationally representative (or a survey of the entire population) for either
specific populations (such as eighth graders for the NELS), or for the
entire population (such as the SIPP or the CPS).
Most of the data sets that will be examined in this book are not only
nationally or locally representative, but also cover a very broad range of
topics. For example, the PHDCN currently has three waves of data, and
it includes individual and family information, such as basic as well as
detailed demographics, family mental health, exposure to violence,
parental warmth and involvement, social support, community involvement, neighborhood structure, legal and health history, and family relationships. Because the focus of the study is on the very young, a vast
amount of information is available on infants’ temperament, physical
growth and development, cognition, and maternal pregnancy conditions. Roughly 27 instruments and scales are used in the first wave of
data collection study (with as many or more instruments and scales in
subsequent waves), including the Kagan Mobile Task/Latency to Grasp,
the Infant Behavior Questionnaire, the Infant Behavior Rating, and the
Growth Assessment Form. Other instruments in the survey measure
emotional and physical home environment and child maltreatment. The
PHDCN also includes a community survey; systematic, in-person social



Secondary Data Analysis

observations and videotape of physical, social and economic neighborhood characteristics; and neighborhood data from 343 neighborhood
clusters (where neighborhoods are determined by ecologically meaningful physical areas). Likewise, the SIPP includes a wide array of variables,
to examine a wide range of topics. Generally, each variable is collected
for each family on a monthly basis over a 2-½ to 4-year period. Special
modules are included in the SIPP data, however, so that those with particular types of disabilities can be identified, and wealth and school
financing can be examined (to give just a few examples of modules).
While most of the data sets described so far are longitudinal, many
data sets are cross-sectional, such as the CPS, administered by the Bureau
of the Census for the Bureau of Labor Statistics. In the CPS data, the
same questions are asked for each of the different months of the survey,
but new samples are drawn for each sampling period (with some overlap
of sample members from one month to the next). Other cross-sectional
data sets examined in this book include the Public Use Microdata Sample
(PUMS), the U.S. Census, and the Continuing Survey of Food Intake by
Individuals. Often, the sample sizes for these cross-sectional data sets are
larger than for the longitudinal data sets, allowing for greater precision in
estimation. It is possible for a cross-sectional data set to be for a single
year, or for many years. Generally, when using a single cross-sectional
year of data, it is difficult to examine cause-and-effect relationships
because the cause must precede the effect (although it is sometimes possible when retrospective questions are asked). When using a data set such
as the CPS, which contains many cross-sectional years, we may be able to
examine how factors in previous years affect outcomes in subsequent
In this book, I refer to data sets as large, in terms of both the number
of observations and the number of discrete pieces of information about
each observation. Data collection may have occurred over relatively
short or long periods of time. They generally contain hundreds or
thousands of questions.


Advantages, Disadvantages,
Feasibility, and Appropriateness
of Using Secondary Data


s noted earlier, there are some good reasons for using secondary
data, including access to large amounts of information, coverage of
a broad range of individuals or other entities (e.g., schools, hospitals),
and the facts that secondary data generally are representative of some
broader population and cover a broad range of topics. In this chapter,
I will briefly examine benefits and costs associated with the use of
such and how these compare to the design, collection, organization, and
use of primary data. A number of questions will then be posed to help
readers determine the feasibility and appropriateness of using either
secondary or primary data.

As noted earlier, secondary data sets tend to be far less costly and take far
less time to organize (in terms of putting the data together in working
form for data analysis) relative to primary data sets. It can take a considerable amount of time to design, collect, and then organize the data in a


Secondary Data Analysis

primary data set. Often, secondary data are available for no cost on the
Internet or through arrangements with the sponsoring organization or
government agency. Whereas 20 or 30 years ago the breadth and quality
of these data sets may have been in question, secondary data sets today
cover a broad array of topics, and the quality of these data sets, from
reputable organizations, is often high. Generally, the sample size and the
number of discrete units of data collected for each sample member are
much higher than what can be collected in a primary data set. Having
several hundred observations with a limited amount of information
from each of those observations is more the norm for primary data sets,
due to cost considerations. These limitations on primary data sets often
make it difficult to apply advanced analysis techniques. With large data
sets, researchers often can take advantage of advanced statistical techniques, such as fixed-effect modeling or hierarchical linear modeling.
Using existing data may allow for the prompt examination of current
policy issues. Because many existing data sets have been designed to
capture policy-relevant outcomes (such as income, food security, or
well-being), they have the potential to begin capturing policy effects as
soon as policy shifts. For example, welfare policy and food stamp policy
were changed in 1996, and a number of data sets (such as the National
Survey of America’s Families, Survey of Program Dynamics, and Welfare,
Children, and Families: A Three-City Study) were set up specifically to
capture immediate policy effects.
Large secondary data sets often span a great length of time, in years
or months. Some secondary longitudinal data sets, such as the Panel
Study of Income Dynamics and the National Longitudinal Survey, have
been collected for decades. This means that individuals or families can
be followed for a very long period. Thus, with this type of data, you are
able to capture intergenerational effects, factors that affect long-term
mobility, or long-term consequences of particular events.
Secondary data often come prepared for use with software (including
SAS, STATA, and SPSS) to assist in data organizing, coding, and analysis.
Thus, instead of having to code all of your variables, an often timeconsuming process, you can sometimes go straight to your analyses, or
do minimal amounts of programming to get to your analyses. For example, when you download data from the PSID web page, the data come
in SAS, Stata, Excel, SPSS, Database File (DBS), or a SAS transport file.
The PSID also includes data books, for only the variables you have

Using Secondary Data

chosen, that can be downloaded in PDF, XML, or HTML formats. The
Interuniversity Consortium for Political and Social Research (ICPSR),
which contains a very large number of data sets, often provides users
with similar types of SAS, SPSS, and Stata files that go along with the data
to make for easier use. Some data sets come with programming code that
can be used to identify missing values for variables, which again saves
time for the end user.
Once users become familiar with one or several of the large data sets,
users often find that they can address a great variety of questions using
these data sets. Thus, while one may be interested in one set of questions
when first using the data, once familiar with the data, other questions
come to mind, and these questions can be answered in a fairly straightforward way. For example, in the PHDCN data, you may first be interested in how neighborhoods affect child behavioral outcomes. In working
with the data, you may find that other interesting variables are available
in the data, such as the potential effects of crime and violence, illegal drug
use, and sexual activity. You can then use these new variables as either
predictor or outcome variables.

While secondary data present many opportunities for researchers, there
are still good reasons for using primary data. One of the problems with
using secondary data is lack of control over the framing and wording of
survey items. This may mean that questions important to your study are
not included in the data. Also, subtleties often matter a great deal in
research, and secondary data may get to broader or related questions,
but not to the exact question being posed by your research. Thus, you
may be looking at particular definitions of concepts such as abuse,
depression, or intelligence that may differ greatly from the definitions of
such concepts in the survey data. Often, the survey may get to broader
conceptualizations, whereas you may be looking for more specific aspects
of a concept. For example, race may be limited to a few, mutually exclusive categories, whereas your study focuses on a nuanced understanding
of race, including the multiplicity of ways in which individuals understand their own racial identities. It is also possible that the questions you
desire may be asked, but of the wrong people. For example, many data



Secondary Data Analysis

sets include information gathered from a single source, often whoever is
considered to be the head of household, about the entire household or
family. Thus, the questions may ask about how much a grandmother
takes care of the children or the amount of play time for children. While
the source of the information may have a good idea of these answers, it is
always possible that they do not and yet may feel a need to answer the
questions that are asked of them.
While sample sizes are usually larger for secondary data sources, this
may not always be the case. When researchers are examining specific
subpopulations, such as children with autism, single fathers, or cohabitating gay or lesbian families, large data sets, representative of broad or
even national populations, may have insufficient sample sizes to conduct
valid analyses for these groups. Some data sources may have only recently
started looking for or using questions to identify such groups. Thus,
it may be difficult to find an existing data source that will allow you to
study a topic like the long-term effects of autism.
All large-scale data sets contain identifiers for families and individuals within the families, and they may contain household identifiers as
well. This doesn’t mean that you will be able to identify the individuals
in the data set, for example, to obtain additional information from
them from outside of the data (and shouldn’t be able to identify such
individuals, given standard research participation protections). Thus,
it is impossible to get additional or follow-up information from the
people who have participated in the survey. In contrast, the opportunity
to re-contact participants to request additional interviews or to followup with specific questions may (or may not) be built into the collection
of primary data. With primary data, there is the possibility of attaining
following-up answers to additional questions; this is not the case with
secondary data. Of course, secondary longitudinal data may add questions that are of interest to the researcher in the next phase of the survey.
Depending on the size of the ongoing data-collection efforts, interested
researchers have at times advocated successfully for the inclusion of
additional questions in subsequent data collection waves.
Data sets collected in the past cannot answer questions about a justimplemented policy change. For these recent issues, one can wait for
questions relating to this topic to appear on an existing data set, or go out
and collect the data in a primary research sample.
In many ways, users of secondary data trade control over the conditions and quality of the data collection for accessibility, convenience, and

Using Secondary Data

reduced costs in time, money, and inconvenience to participants. This
lack of researcher control manifests itself in several ways. For example,
you may know the questions comprising the survey, but you probably
will not know how they were asked. Were there great differences among
the interviewers? How many interviewers were there? Did respondents
understand the questions? Are there particular questions that may be
difficult for respondents to understand and that may have great variability in responses? Is there information about the response rate? If it is low,
what does this mean? Who are the sponsors of the data collection and do
they have an agenda to find particular outcomes? How has the data been
“cleaned” or, in other words, have missing data been imputed in some
way? If so, how? Is this a good method for imputation? Some of these
same questions could be asked of primary research, but, generally, primary researchers will have more control over these issues relative to
those using secondary data.
Many of the secondary data sets discussed in this volume are very
large and complex, and they may take researchers a long time to fully
understand. This may make starting to work on one of these data sets a
bit daunting. Before beginning, it is important to fully research types of
available data sets and understand which groups are included in the
survey, how the sample was constructed, and what kinds of sampling
weights to use if weighting of the data is necessary. Weighting the data in
your analyses can be a particularly difficult process. Do you use individual weights, family weights, or household weights, and for what years do
you use these weights? If you are examining a longitudinal data set, you
may wonder if you use weights from the first year or years or from the
last years.
Secondary data may subvert the research process by “driving the
question,” or only looking at questions that can be answered by the available data. Researchers need to keep in mind that this sort of approach
may be appropriate for doing exploratory work and for developing
hypotheses, but not for testing hypotheses.

This section will take readers through a list of questions that they should
ask themselves before starting a research project to determine whether



Secondary Data Analysis

using primary or secondary data would best serve their research. It gives
some guidance as to whether using secondary data is appropriate, and
which data set to use if it is. The questions are intended to help readers
determine, first, which would be the most appropriate of a variety
of available data sets, and, second, whether the data set, once chosen,
contains key information.
1. Is the population from which the sample is drawn appropriate for
the planned research?
Researchers need to make sure that they are sampling from the
appropriate populations in order to be able to do their research.
If they fail to get an appropriate sample, they will not be able to
do the research they wish to do. For example, if you wish to
examine patients from mental health institutions, and such
patients are not included in the sampling frame, the data will not
be appropriate for such a study. Even if the sample includes
people from mental health hospitals, variables in the data must
contain information about whether the respondent was in a
mental health hospital. Often, sampling frames (or the
population surveyed) do not include those in institutional
settings or the military.
2. Is the dependent variable contained in the data?
This is one of the most important questions to ask. If the answer
is “no,” the question then becomes: Is there a variable that is
conceptually comparable to the variable that you would like to
use? If the answer to this is also “no,” then you will not use these
data. However, if the answer is yes, then you may want to
consider these data. You will need to determine how conceptually
far away this variable is from the variable you would truly like to
use, and the consequences of this. For example, you are using
longitudinal data and hypothesize that childhood health affects
the type of neighborhood the individual will live in as an adult.
The longitudinal data may have information on average income
in the adult neighborhood or the respondent’s perception of the
adult neighborhood (Is it safe? Is it poor? Is it really poor?). Your
preferred variable may be the percentage of the population living
below the poverty line in the adult neighborhood. Are these
measures close enough for you to use these data for your work?

Using Secondary Data

3. Are the necessary independent variables of interest available?
This is sometimes a trickier question than the dependent variable
question because there are oftentimes many variables that you
will need as both primary independent and control variables. You
first need to develop (either formally or informally) the
conceptual model at the basis of the issue you are analyzing,
which should suggest which are the main independent variables
you hypothesize will influence your dependent variable, as well as
the other variables that might be related to that dependent
variable and should be included as control variables. Once you’ve
done this, you could then check the literature to see if there are
alternative hypotheses that suggest additional variables to
include. Equally important is the specificity of the measurement.
For example, if you are examining the effects of childhood mental
health problems as a primary independent variable, you must ask
yourself whether having this as a yes/no question is sufficient for
testing your hypothesis. Do you instead need the degree of
impairment or specific type of mental health problem to better
examine the effects of mental health on some outcome? Who is
asked the question? Who has determined that the child has
such a diagnosis: a physician or specialist, or a parent or other
The same kinds of questions can be asked about other control
variables or variables that examine alternative hypotheses. First,
are variables available to test alternative hypotheses? Is it truly a
mental health problem that is affecting the outcome, or is it some
other variable that others have found to be important and that
you may need to control? If the data does not include variables
for these alternative hypotheses, you may get biased results.
4. If replicating a study, how do these data differ from those used
previously, and will this make a difference in running and
interpreting analyses?
It may make little difference that the variables available in your
data differ somewhat from the variables in the original study,
if you are concerned not about completely replicating the
previous work, but only in finding whether results differ when
variables are specified in a slightly different way. If you are mainly
concerned with using a different set of data to see if the results



Secondary Data Analysis

from a previous study hold, however, then finding a data set with
the same variables will be important.
5. Does the available data have adequate identifiers for the target
groups for analysis (women with Alzheimer’s, adolescents with
eating disorders, children of gay couples)?
Without such identifiers, it will be impossible to conduct the
planned analyses for your study. If these identifiers are available,
you must then determine if the sample sizes for sub-groups are
large enough to run analyses. It’s often difficult to determine
what sample size is “big enough.” Obviously, having more
observations per included variable will help you find relationships
when they exist in the population. If you are running
nonparametric analyses, this need for larger sample sizes becomes
greater in order to adequately test your hypotheses.
6. Is it important to be able to generalize results to the
general population (of the United States, for example),
to specific populations (such as the elderly), or to a far
lower-level population (such as clients of a particular clinic)?
If you need to generalize to a more broadly defined population,
then using a secondary data set will likely be the way to go. If you
need only to generalize to a lower-level group, then using primary
data may be a more feasible and appropriate option. It is possible
that data have already been collected from some lower-level
group, such as a clinic, but this isn’t likely. Even if the data have
been collected, it’s unlikely that such data will contain the kinds
of information that you need, so collecting data will probably be
the best way to examine your hypotheses.
7. Does the data set of interest require special authorization to
Some data sets are available only with special contracts because of
the confidential nature of the data and risk that its specificity may
compromise individuals’ identities. Often, these contracts require
the researcher and his or her institution to sign contracts, with
fees attached, for use of the data. You will need to determine
whether your institution will be willing to incur the risks of your
use of such data. Generally, institutions with such data, such as

Using Secondary Data

the federal government and universities, have become more
stringent in their requirements for granting data access. Doctoral
students often cannot receive such data without having the
doctoral chair or some other high-ranking person or fiduciary
agent accept the data for the doctoral student. Such data must be
kept off-line, and it cannot be kept on networks that back up the
data. Passwords for accessing the data and data encryption are
often necessary as conditions for receiving the data. Often,
contracts for specific periods of time must be signed, and the data
must be destroyed after the contract expires. Generally, the data
must be used only at the institution, not at home. The use of such
data sets sometimes costs money, as well. For example, for using
the restricted version of the PSID, $750 must be paid to the
Survey Research Center at the University of Michigan for each
project that uses the data. Often, these data sets will contain a
public-use component that does not include the potentially
identifying information, which is available to researchers without
special permission.
8. Do you, or someone you can hire, have the programming skills to
use the data?
Some data sets are quite complicated and require advanced skills
in programming in SAS, Stata, or SPSS. If you lack these skills,
you will not be able to use some of the data sets described in this
book, or you may need to hire a programmer to work with you.
On the other hand, many data sets are far less complicated than
they were only several years ago. For example, many data sets
available on the web include SAS, Stata, and SPSS programs that
allow you to input and format the data without having to write
your own code. This doesn’t mean that you don’t have code to
write. Eliminating missing values, combining and constructing
variables, and extracting data for particular years or
subpopulations will require some degree of programming skill.
In my own work, much of my day is spent programming, in SAS
or Stata, to get variables in the proper working condition. Once
the data have been “cleaned” (i.e., missing values imputed or
deleted and variables put in a usable format), the more difficult
programming begins. For example, you may want to look at



Secondary Data Analysis

longitudinal data for children when they are between the ages of
0 to 4, examining their income, child care, health care, or other
variables, and different children may be ages 0 to 4 in different
waves of the data. To do this, you will need to use data loops and
arrays, with which you should be familiar before starting such a
project. Obviously, different software programs have different
programming languages (SAS, SPSS, Stata), and it will be helpful
to know how to program in at least one of the programming
9. How quickly do you need results?
If you are examining a new policy and need to determine if it is
helping or hurting, secondary data, if it’s available, may aid in
getting quicker results. Of course, the speed with which you can
generate an analysis depends on how well put together the
secondary data are and on your programming skills, as noted
above. The use of secondary data can save time and resources for
the researcher. Even more important, however, using existing
data bypasses the need to ask for time and a certain degree of
trust from new research participants who, by the purview of the
social work profession, are often among the most vulnerable.
If the data are not available, and quick results are needed,
collecting primary data may the best way to examine new


Secondary Datasets

In this chapter, I will describe several social work, social science, and
related datasets, along with where and how to access them and their key
characteristics. Of course, there are thousands of secondary datasets
available in myriad places, and I will cover only a portion of the largest
and most widely used datasets. I will indicate where to find these data,
and, in some cases, the types of analyses that one can undertake using
them. I will also indicate whether datasets have public use versions,
which generally strip the data of geographic identifiers or other potentially identifying information and which, in addition, have another version that includes these identifiers. The datasets will be described by
various features, including:
1. Cross-sectional, longitudinal
2. Years covered
3. Unit of analysis
4. Sample size
5. Population(s) covered and
6. Basic categories of information covered
I also summarize much of this information in Appendix Tables 1 and 2.
One of the best places to find many of the datasets that will be
described here is the Inter-University Consortium for Political and Social


Secondary Data Analysis

Research (ICPSR), the largest archive of social science datasets (http:// There are not only original datasets in this collection, but also datasets that have been assembled by other researchers.
For example, I recently needed the Current Population Survey (CPS)
from the 1960s to the present. The CPS website (
cps/) contains data only from recent years. I went to the ICPSR and
found many of the supplements to the CPS (such as the Veterans
Supplement, the Food Security Supplement, and the Tobacco Use
Supplement), but also found the March, individual-level extracts from
1968 to 1992, which were put together by Robert Moffitt (Study Number
Thus, instead of having to put together each individual year of the CPS,
the ICPSR contained merged data that were ready to use. In looking for
the General Social Survey, I was able to find information about the
Japanese General Social Survey, the Chinese General Social Survey, and
the Polish General Social Survey, among many others. Many of these
datasets are ready for downloading, and many have been cleaned (e.g.,
putting in missing value codes for missing data, assembling data by
person, family, or household) by those putting them together.
The Data Ferret from the U.S. Census Bureau, located at http://, is another key source for obtaining data. The
data included from this website include the American Community
Survey; the American Housing Survey; the Consumer Expenditure
Survey; County Business Patterns; the Current Population Survey; the
Decennial Census of the Population and Housing; Decennial Public
Use Microdata Samples; the Mortality Sample; the National Ambulatory
Medical Care Survey; the National Health Interview Survey; the National
Hospital Ambulatory Medical Care Survey,; the National Survey of
Fishing, Hunting, and Wildlife Associated Recreation; the New York City
Housing and Vacancy Survey; Small Area Health Insurance Estimates;
Small Areas Income and Poverty Estimates; The Social Security
Administration Survey of Income and Program Participation; and the
Survey of Program Dynamics. To get the Data Ferret program, go to the
web page given above, then click on “Launch DataFerrett.” You will
be asked for your e-mail address, which you will use when you log into
the system. Once you are in DataFerrett, you can choose any of the above
datasets and download the data in a variety of formats, create tables,
or create tables for downloading.

Secondary Datasets

One more excellent site that I will often refer to in the book is the
Pennsylvania State Simple Online Data Archive for Population Studies
(SODA POP). There are hundreds of secondary datasets available at
this site (, many of them
accessible to those outside of the Penn State system (see http://sodapop. for gaining
access to the datasets on this site). A truly nice aspect of this site is that
you can search all of the datasets by keyword (see http://sodapop.pop. Thus, if
you are interested in mental health, you can type this phrase into their
search engine and it will show you all the variables and datasets that
contain that phrase (variables, labels, or anywhere else in the description
of the data/variables).
For educational datasets, an excellent archive is the International
Archive of Educational Data (
html). Here, you will find datasets and online tools to examine a wide
range of educational surveys.
Large secondary datasets not covered in this book are numerous, and
I will mention a few here that may be of interest to social work and other
social science researchers. One is the Combine study, which examines
treatment options for alcoholism; it included 1,383 alcohol-abstinent
volunteers and ran from 2001 to 2004 (Anton, 2006). Second, the
National Institute of Justice (NIJ) Data Resources Program is a repository of datasets collected through NIJ-funded grants. These datasets are
archived in the National Archive of Crime Justice Data within the ICPSR.
A few of these datasets include the National Evaluation of the National
Institute of Justice Grants to Combat Violent Crimes Against Women on
Campus Program, 2000–2002; Arrestee Drug Abuse Monitoring (ADAM)
Program in the United States, 2003; The National Crime Victimization
Survey, through 2008; and the National Crime Victimization Survey:
School Crime Supplement, 2007. Also, see Boslaugh (2007) for more
information on health-related datasets.
Most of the rest of this book goes into greater detail about all of the
secondary datasets previously mentioned. For some the datasets, I go
into where and how to access the data and, for some datasets, I show you
screen shots for using the data. I find that the screen shots are often helpful in seeing what the datasets look like to help users access or use the
data. Other datasets, in which less information is given on accessing or



Secondary Data Analysis

using the data, are fairly straightforward (download the data and start
working), or are so complicated that they would take up too much space
to present such information here. When codebooks or descriptions of
the data are available online, I indicate where you can find those descriptions or codebooks. I sometimes give brief SAS code for using some of
the data as well, to give readers the feel for what to do when accessing and
using some of these data. I will often indicate which datasets are best for
particular types of research—such as the study of children, health, education, poverty, intergenerational studies, etc.—often by presenting what
kinds of studies have come from these datasets. For some datasets,
response rates are easily available, and I indicate these response rates
when discussing the dataset, while for others, response rates could not be
found. I present the datasets in alphabetical order.

The Adoption and Foster Care Analysis and Reporting System (AFCARS)
collects case-specific data on all finalized adoptions and foster care placements facilitated through state welfare agencies, or private organizations
contracted by public welfare agencies, in all 50 states, the District of
Columbia, and Puerto Rico. The children in the dataset range in age
from under 1 year through 20 years. Data is available from 1995 through
2005. Data collection became more efficient in 1998, meaning better
data was collected for more states. States submit data twice over the
course of a year that begins on October 1 and ends the following year on
September 30. The original purpose of AFCARS was to gather information for state and federal level policy- and program-management uses
and to research the nature of adoption and foster care. The information
is cross-sectional.
States are federally mandated to provide this data on an annual basis
under Title IV-B/E of the Social Security Act (Section 427). The dataset
includes all cases of foster care and adoption that occur through public
welfare agencies. The dataset represents only those types of arrangements. Agencies securing adoptions outside of state agencies provide
information voluntarily, and this information is not included in the
publicly available dataset.

Secondary Datasets

The number of states reporting and number of observations vary
across years. In 2005, 52 states/districts reported data for both adoptions
and foster care, totaling 51,485 and 801,200 observations, respectively.
Efforts are made to remove duplicated adoption observations, but
beware of such duplication. The foster care data include only one observation per child; for those children with multiple observations, only the
most recent is included.
The dataset provides separate information on a variety of variables
for children who are adopted and those who are in foster care. The 37
adoption variables include child demographics, the presence of disabilities (e.g., physical, mental retardation, visual/hearing impairment,
emotional disturbance), birth parents’ dates of birth and dates of termination of rights, date of adoption, adoptive family structure, adoptive
parents’ demographics and any pre-adoptive relationship to the child,
and information about the agency placing the child, as well as whether
the adoptive family is receiving Title IV-E support and if so, the amount.
The 66 foster care variables include child demographics, the presence
of disabilities, information about previous removals from home and
adoptions, manner of removal (e.g., voluntary or court ordered), reason
for removal (e.g., abuse/neglect, alcohol/drug abuse by parent or child,
child disability or behavior problems, parent death or incarceration,
abandonment), placement setting (e.g., pre-adoptive home, relative or
nonrelative foster home, group home), case plan goal (e.g., reunification
with parent, live with relatives, adoption, long-term foster care, emancipation, guardianship), information about the principal caretaker and foster
caretaker as well as the family structure of each, reasons for discharge from
foster care (e.g., reunification, living with other relatives, adoption, emancipation), and use of state support. Foster care variables include whether
AFDC/TANF, SSI, and SS payments supported relative caretakers.
Researchers using AFCARS data can study case-level data at a state or
national level and examine descriptive data about children and their
birth, adoptive and foster parents; placement types, lengths and goals;
and the amount and impact of social assistance. Researchers also can
examine trends and predictors of a variety of foster care and adoption
outcomes. The dataset allows for policy analysis, given that one can study
and compare data across states. Each annual dataset includes statespecific notes that inform this type of analysis.



Secondary Data Analysis

Annually, the U.S. Department of Health and Human Services
publishes The AFCARS Report, outlining descriptive information about
children in foster care, waiting for adoption and adopted. This report is
available through, the Children’s
Bureau web page. Other researchers recently have used AFCARS to study
trends in kinship care (Vericker, Macomber & Geen, 2008), the factors
that influence foster care discharge rates after termination of parental
rights (Smith, 2003), predictors of reunification for foster children with
an incarcerated parent (Hayward and DePanfilis, 2007), and the relationship between child developmental and medical conditions on out-ofhome placement (Rosenberg & Robinson, 2004). Another researcher used
AFCARS to examine the impact of subsidies on adoption (Hansen, 2007).
These data are available from National Data Archive of Child Abuse
and Neglect (NDACAN), The codebook
for the 1995 to 1999 versions of the data is available at http://www.

Child Neglect: Cross Sector Service Path and Outcomes uses administrative
and census data to gather longitudinal, child-level information about
cross-sector service use and patterns, caregiver characteristics, and later
child and adolescent outcomes. The sample comprises those receiving
AFDC or TANF, with approximately half the sample including children
with reports to child welfare for maltreatment and half without such
reports. All children in the sample were born between 1982 and 1994,
were under 12 years old when data collection began, and lived with a
family that received AFDC. Data collection began in 1993 to 1994 and
followed the children through 2001. Samples were drawn and data collected for a Midwestern metropolitan area. It is unclear whether the data
are representative of children in this region or of any other groups.
Child Neglect uses a matched, two-group comparison design. The
researchers first created a sample of children who had been reported to
child welfare services for maltreatment in 1993 or 1994, and then
matched these children to AFDC records. These children constituted the

Secondary Datasets

maltreatment/AFDC group. Maltreatment categories for this group
included neglect, physical abuse, sexual abuse, or mixed types; emotional
abuse and other forms of maltreatment were excluded. The comparison
group, randomly selected from the remaining children in the AFDC files
and matched by birth year and city or county of residence, includes children whose families were not reported to child welfare for maltreatment
in 1993 or 1994. Each group contains only one child per family, and the
AFDC-only group excludes those children who have a sibling, or who
live in a home with other children, with a report of maltreatment. The
total sample includes 10,187 children with 5,087 in the maltreatment/
AFDC group and 5,100 in the AFDC-only comparison group. Four age
cohorts were created in each sample and followed for the eight years of
data collection. This accelerated panel design allows for statistical analysis of 19 years’ worth of developmental information, not just eight.
Child Neglect collects child-level data through administrative records
from education, health, juvenile and adult corrections, and social service
organizations. Some adult-level data are available on some variables.
Child abuse and neglect information includes age at time of the report,
reason for the report, relationship with the perpetrator, substantiation
status, type and severity of maltreatment, and occupation of the reporter.
Child welfare services data include information about foster care services
(e.g., age at entry, reasons for placement and exit, and type, frequency,
and length of placements) and in-home services, such as intensive family
preservation services and less intensive family services. Incomemaintenance variables include information about spells on AFDC or
TANF (starting in 1997) and reasons for receipt. Data is gathered on the
admission date and offense type for those who spent time in juvenile
justice facilities. Adult corrections data provide information about incarceration in state facilities (not local or county jails), including admission
year, sentence length and offense (including property and financial
crimes or possession/selling of drugs). Medicaid billing information
includes data about inpatient, outpatient, and hospital care and problems at birth that may impact development later (collected from both the
mother’s and the child’s records). Educational data include disability
type and date of testing. Community-level data include 1990 residential
census tract data (e.g., population total, income, race, education, unemployment and mobility) and crime information.



Secondary Data Analysis

Researchers can use Child Neglect: Cross Sector Service Path and
Outcomes to examine service utilization among children experiencing
abuse and neglect, the association between service utilization and later
outcomes, and the relationship between TANF use and child and adolescent outcomes. This information can be compared with those receiving
AFDC or TANF but without maltreatment reports. Researchers have
explored the relationship between criminal justice system involvement
and welfare (Jonson-Reid, 2002) and maltreatment (Bright, Jonson-Reid
& Williams, 2008). Research has also examined the association between
maltreatment and special education eligibility (Jonson-Reid, Drake,
Kim, Porterfield, & Han, 2004) and risk of death (Jonson-Reid, Chance,
& Drake, 2007). Researchers also have used the dataset to explore possibilities for improved technology to map referrals and services (Hovmand,
Jonson-Reid, & Drake, 2007).
These data are available through National Data Archive of Child
Abuse and Neglect,, or see Http://www.
html. The codebook for the data is available at http://www.ndacan.

Common Core of Data, a database of public elementary and secondary
schools of the U.S. Department of Education’s National Center for
Education Statistics, annually collects fiscal and nonfiscal data about
students, staff, and characteristics of public schools; public school districts; and state education agencies in the United States. The dataset provides an official listing of elementary and secondary schools and school
districts nationally and provides basic information and statistics on
schools in general. Unlike many datasets, the CCD actually covers an
entire national population (that of public elementary and secondary
schools/districts) rather than just a sample that is representative of that
population. Often, the CCD is used as a sample frame from which
random samples of schools or districts are selected as part of the data
collection process in other surveys.
The data is collected each year from a population of approximately
97,000 public elementary and secondary schools and approximately

Secondary Datasets

18,000 public school districts. Data come from the 50 states, the District
of Columbia, Department of Defense schools, the Bureau of Indian
Affairs, and outlying areas, such as Puerto Rico, Guam, American Samoa,
and the U.S. Virgin Islands. The data is collected using five surveys sent
out to the state education departments and completed by agency officials. Most of the data are acquired through administrative records. The
data for schools and districts are meant to be comparable across states.
The CCD comprises five datasets: Public School Universe, Local
Education Agency (School District) Universe, state aggregate nonfiscal
data, state aggregate fiscal data, and school district fiscal data. The Public
School Universe includes information on the location and type of school,
enrollment by grade, student characteristics, and number of teachers.
The Local Education Agency (School District) Universe has information
on the number of current students and the number of high school graduates. The state aggregate nonfiscal dataset has information on students
and staff, such as the number of students per grade level and high school
graduates and completers. Both the state and school district aggregate
fiscal data include revenue and expenditures by function and average
daily attendance and enrollment, respectively.
The CCD includes variables pertaining to dropouts; the receipt of
diplomas and GEDs; guidance counselors and institutional aides; library
and library/media support; Individualized Education Programs for students with disabilities; alternative education schools, charter schools,
magnet schools and programs; kindergarten and pre-kindergarten;
educational agencies (state, federal and other); migrant students; sharedtime schools; supervisory unions; and Title I eligible schools. The CCD
also includes information about schools’ participation in the Free Lunch
Program, Reduced-Price Lunch Program, and Head Start Program, and
geographical information, such as community size and whether the
school is located in a metropolitan or micropolitan statistical area.
The data are available through the National Center for Education
Statistics ( Researchers can build tables, search
for public schools, and compare the data across states. Data on private
schools also can be accessed through the NCES Private School Universe
Survey (
The data can be assembled by state, county, school district, or
school—currently for 1987–1988 through 2006–2007—as shown in the
following page.



Secondary Data Analysis

From here, you can examine a limited number of variables by grade
level. You can also examine the most requested tables, such as grade
completers by race and pupil-to-teacher ratios.
A second way to access the data on this site is by downloading entire
datasets, with either SAS or SPSS code, or an SAS dataset. Data documentation also is available for download (
versions.asp).The download page looks like the figure given in the following page for the local education agency (school district) survey.
From here, click on the 2006–07 SAS zip file, and agree to the terms
of use. The file downloads in SAS format (instead of a flat file that would
require either SAS or SPSS code to transform the data into SAS or SPSS).
From here, the data was put into the c:\SAS directory. Next, go into SAS,
and write the code:
libname in ‘c:\SAS’;
data a;set in.ag061c;
proc means n mean std;

You then get the result as shown in Table 4.1.
For a second dataset, the State Dropout and Completion Data
File: 2005–06, this dataset was again downloaded in SAS format, and the

Secondary Datasets

sample sizes and mean values as seen in Table 4.2 were obtained for a
variety of variables.
Similar downloads are available for Census 2000 School District
Demographics, Local Education Agency (school district) Universe
Survey Dropout and Completion Data, National Public Education
Financial Survey (State Fiscal), Public Elementary/Secondary School
Universe Survey, State Nonfiscal Public Elementary/Secondary Education
Survey, State-Level Public School Dropout Data, and Survey of Local
Government Finances, School Systems. Codebooks for many of these
datasets, as well as data downloads, are available from The Pennsylvania
State University at



Secondary Data Analysis
Table 4.1 The SAS System The MEANS Procedure




Std Dev


Number of
Associated with
Aggregate FTE
Associated with
PK - 12 Students
Calculated Total
Membership of
the LEA
Migrant Students
Served in a
Summer Program
Special Education
- Individualized
Program (IEP)
English Language
Learner Students
Teachers Prekindergarten
Teachers Kindergarten
Teachers Elementary
Teachers Secondary

































Teachers Ungraded
Teachers - Total












Secondary Datasets
Table 4.1 The SAS System The MEANS Procedure (Continued)




Std Dev


Aides /
Coordinators and
Counselors Elementary
Counselors Secondary
Counselors Total
Librarians /
Media Specialists
Librarians /
Media Support
Support Staff
Support Staff
Student Support
Services Staff
All Other Support















































The Continuing Survey of Food Intake by Individuals (CSFII), conducted by the U.S. Department of Agriculture, aims to study the
food-consumption patterns of the people of the United States, with a
particular emphasis on the effects of nutritional policies, exposure



Secondary Data Analysis
Table 4.2





Dropout Rate (Grades 9 through 12)
Dropout Rate (Grade 9)
Dropout Rate (Grade 10)
Dropout Rate (Grade 11)
Dropout Rate (Grade 12)
Dropout Rate (American Indian / Alaskan
Native, Grades 9 through 12)
Dropout Rate (Asian / Pacific Islander,
Grades 9 through 12)
Dropout Rate (Hispanic, Grades 9
through 12)
Dropout Rate (Black, non-Hispanic, Grades
9 through 12)
Dropout Rate (White, non-Hispanic,
Grades 9 through 12)
Dropout Rate (Male, Grades 9 through 12)
Dropout Rate (Female, Grades 9
through 12)
Dropout Rate (Gender Unknown, Grades 9
through 12)
Dropout Rate Enrollment Base (Grades 9
through 12)
Dropout Rate Enrollment Base (Grade 9)
Dropout Rate Enrollment Base (American
Indian / Alaskan Native, Grade 9, Male)
Dropout Rate Enrollment Base (American
Indian / Alaskan Native, Grade 9, Female)
Dropout Rate Enrollment Base (American
Indian / Alaskan Native, Grade 9, Gender
Dropout Rate Enrollment Base (Asian /
Pacific Islander, Grade 9, Male)



























to pesticides, and the influence of diet on various health problems.
The original study was conducted in 1985 and 1986, with continuations
and changes in 1989–1991, 1994–1996, and 1998. The samples are
nationally representative.
The 1985 and 1986 data included both an all-income sample and a
low-income sample (see

Secondary Datasets

docid=7889). The all-income sample included 1,500 women aged 19 to
50 and their children, aged 1 to 5 (with around 550 children in each
sample year), and a sample of 1,100 men, aged 19 to 50, in the 1985
sample only. The low-income sample had 2,100 and 1,300 women and
1,300 and 800 children in 1985 and 1986, respectively. Food intake was
collected for six days, over a one-year period, for each year.
The 1989–1991 study was conducted as three separate one-year surveys with data collected over three consecutive days, with an aim to provide information on the usual intake of foods and nutrients. The first day
was an in-home interview in which respondents recalled the food consumed the previous day. The second and third days’ data were collected
through a self-administered dietary recall. Six weeks after the three-day
data collection period, respondents completed the Diet and Health
Knowledge Survey. Sample sizes are relatively large, with 15,192 individuals for the one-day dietary intake, and 11,912 for the three-day
dietary in-take.
The 1994–1996 continuation, which again aimed to find the “usual
intake” of food per individual, collected one-day dietary intake data from
16,103 persons of all ages, 4,253 of whom were children aged 0 to 9 years
old. Respondents provided details about their food intake in the past
24 hours, reporting on the specific types of food consumed and the
amount of each food consumed. To improve efficiency and minimize
error, a computerized coding system (Survey Net) converted food intake
into its component foods and nutrients.
The 1998 continuation, called the Supplemental Children’s Survey,
CSFII 1998, collected data from a larger sample of children (N=5,559) to
estimate the exposure to pesticide residues in the diets of children. CSFII
1998 data can be combined with 1994–1996 data. The Department of
Agriculture, through the Food Quality Protection Act of 1996, required
the collection of this data.
The CSFII contains variables relating to health and diet such as nutrient, vitamin, and supplement intake; dietary fiber; niacin and calcium
equivalents; folate content; source frequency of food and beverage intake;
household food; identity of the main food preparer; breast-fed children;
school and employment of household members over 15 years of age;
family income, both absolute and as a percentage of the poverty level;
and region and size of the area (metropolitan vs. nonmetropolitan area)
of the respondent.



Secondary Data Analysis

The CSFII data have been used to study many aspects of food intake.
Dietary studies involving CSFII data have looked at the relationship
between healthy diets and family income (Beydoun, Powell, & Wang,
2009), how demographics affect diet quality (Forshee & Storey, 2006),
and the correlation between self-assessed health status and diet intake
(Goodwin, Knol, Eddy, & Fitzhugh, 2006). The topic of food intake also
has been studied for specific ages, from preschool (Kranz, Hartman,
Siega-Riz, & Herring, 2006), to school-aged children (Suitor & Gleason,
2002) to the elderly (Sebastian, Cleveland, Goldman, & Moshfegh, 2007).
Other research has studied the relationship between gender and ethnicity
and diet (Beydoun & Wang, 2008); the links among nutrition, food security, and obesity (Beebout, 2006); the relationship between school meal
participation and nutrient intake (Gleason & Suitor 2003); and calcium
requirements (Hunt & Johnson, 2007).
The 1989-1991 CSFII data are available from National Technical
Information Service, The 1994–1996 and 1998
data and documentation are available at The Pennsylvania State
University through Simple Online Data Archive of Population Studies
at, and from the
Department of Agriculture (
Place/12355000/pdf/Csfii98.pdf) on CD-ROM.

The Current Population Survey has been conducted as a joint effort
between the Bureau of Labor Statistics and the Bureau of Census since
the 1940s. The purpose of the CPS is to provide information on characteristics of the labor force in the United States. It is often used to evaluate
the number and percentage of unemployed and to measure the potential
labor force. It also is used to calculate and analyze wage rates, hours of
work, and earning trends for different demographic groups. The study is
nationally representative of the civilian noninstitutionalized population.
The data can be used to look at the country as a whole and sometimes can
be used to examine states or other geographic areas, depending on the
sample size of those states.
The CPS collects cross-sectional monthly data from around 50,000 to
100,000 households. Each member of the household aged 16 and above

Secondary Datasets

is interviewed. Data is collected by either in-person or telephone interviews. General labor force information is collected each month, and
other data on specialized topics are gathered through periodic additional
supplements. Each household is a participant in the survey for eight
survey periods. When it first enters the survey, it is involved in four consecutive monthly surveys, then it is absent from the next eight, and then
it partakes again in another four months.
Outside of employment numbers, the CPS provides data on demographics, displaced workers, computer and internet use, educational
attainment, industry, occupation, marital status, minimum wage work,
poverty, volunteering, women’s employment, and youth employment.
The survey was redesigned in 1994 to take into account changing patterns of life in the United States. The redesign was implemented to collect
more monthly data on earnings and hourly wages, child care problems,
and problems associated with being laid off.
The March Supplement to the CPS, the Annual Social and Economic
Supplement, gives data on income, poverty, and health insurance for the
country. Tables for this information are available at the U.S. Census web
page, and downloads of these data are available at http://www.bls.census.
gov/cps_ftp.html#cpsmarch for 1998 to 2008. Other data in this March
Supplement include region, state, principal city, city size, rental subsidies, receipt of food stamps and other government assistance, unearned
income, taxes paid, family type, income percentile, disability status
and disability income, reasons for missing work, type of worker, involvement in worker training, type of health insurance, health status, and reasons for receiving Social Security income or Supplemental Security
Income. All of this information is available at the household, family, and
individual levels. Note that the public CPS does not include tax information (and thus no information is given on the earned income tax credit
or taxes paid). It also does not include capital gains income. Also note
that roughly the top 6 % of income is top coded, meaning that good
estimates cannot be made of the top income earners in the country.
Internal versions of the CPS that do not contain these top codes are
available for researchers willing to work on site (check with the U.S.
Census Bureau, Center for Economic Studies for location, at http://www.
The CPS data are used by researchers in studies on a variety of subjects. Topics include labor, in which studies have been conducted on



Secondary Data Analysis

union membership and coverage (Hirsch & Macpherson, 2003), job stability (Jaeger & Stevens, 1999), minimum wage (Burkhauser, Couch, &
Wittenberg, 2000), and the labor market skills of recent male immigrants
(Funkhouser and Trejo, 1995). The CPS also has been used to provide
estimates of adult cigarette smoking by state and region (Shopland,
Hartman, Gibson, Mueller, Kessler, & Lynn, 1996), income inequality
and health status (Burkhauser, Fend, & Larrimore, 2008; Mellor & Milyo,
2002), expectations of work for single mothers (Burkhauser, Daly,
Larrimore, & Kwok, 2008), and child support from maritally disrupted
men (Cherlin, Griffith, & McCarthy, 1983).
CPS data can be found at the National Bureau of Economic Research
(NBER) website and at The
Pennsylvania State University Simple Online Data Archive for Population
Studies (SODA POP) at
cps/dnd, where codebooks for the data also are available. Basic monthly
data and supplements are available in SAS, SPSS, and Stata data files.
Basic monthly data, however, are available only from 1976 to the present,
and supplemental data are available only for 1964 to the present.
Data also can be found at the ICPSR, and a number of people have
created datasets that include the March Supplement to the CPS from
the 1960s to the 1990s. Data Ferret also has all data from the CPS for
all months from January 1994 to the present. The variables are wellorganized on the Data Ferret web site, so that you can download information on many topical areas, including food security (1995–2007),
fertility (1998 –2008), Internet (1994 –2007) and library use (2002), and
work schedules (1997–2004). The Data Ferret screen for the CPS looks
like the figure shown in the following page.
From here, you can click on any of the CPS datasets, view the variables for any of the years available, and then choose the variables for
downloading by putting them into your data basket. You would then go
to Step 2, at the top of the page, and either download them or make a
table that can be downloaded. If you are downloading the data (and not
the tables), you can do so in all of the popular statistical packages.

The Developmental Victimization Survey (DVS) is a longitudinal study
that collects data about children’s experiences with victimization and

Secondary Datasets

adversity, children’s mental health and delinquent behaviors, and child,
parent, and household characteristics and demographics. The Crimes
Against Children Research Center conducted the research with funding
from the U.S. Department of Justice’s Office of Juvenile Justice and
Delinquency Prevention. The dataset includes information about 2,030
children aged 2 to 17 years, collected in 2002–2003 and again in 2003–
2004 (N=1,467). The study uses the Juvenile Victimization Questionnaire
to collect victimization information as well as the Trauma Symptom
Checklist for Children (TSCC) and Trauma Symptom Checklist for
Young Children (TSCYC) for other variables of interest.
The data are nationally representative for children aged 2 to 17 living
in the contiguous United States. The data has been weighted to account
for the number of eligible children in each household and the undersampling of Black and Hispanic children, and to make the sample equal to
the national child population. These weights are based on July 2002
census data.
Researchers collected data during telephone interviews, using listassisted, random-digit dialing to select participants, with 70% of those
eligible agreeing to participate. Caregivers, usually the parents, provided



Secondary Data Analysis

family demographic information. Researchers selected a sample child
from all children in the household by selecting the child with the most
recent birthday. If that child was under 10 years old, the researchers
interviewed the caregiver or parent most familiar with the child’s daily
routines, using a caregiver version of the survey. If the child was between
10 and 17 years of age, the researcher interviewed the child.
The DVS contains a variety of variables related to child health,
victimization, child behavior, and parental, household and neighborhood characteristics. Child health variables include the presence and age
of diagnosis for posttraumatic stress disorder (PTSD), anxiety, attention
deficit disorder (ADD) or attention deficit hyperactivity disorder
(ADHD), oppositional/defiant disorder (ODD) or conduct disorder,
autism, developmental delay or retardation, depression, and learning
disorders. The survey also provides data on the presence of certain
emotions and related behaviors, such as sadness, fear, worry, anger, the
feeling that one is hated or disliked, aggressiveness, tantruming, crying,
daydreaming, and absentmindedness, in addition to such attitudes or
behaviors as calling someone bad, throwing things, wanting to hurt self
or wishing for death, and hurting, arguing, or yelling at others. Variables
also provide information about participation in counseling or therapy.
DVS collects information about five types of victimization experienced by children using the Juvenile Victimization Questionnaire (JVC).
The five categories and the suboffenses include (a) conventional crime
(e.g., robbery, personal theft, vandalism and assault); (b) child maltreatment (e.g., physical abuse, psychological/emotional abuse, neglect and
custodial interference/family abduction); (c) peer and sibling victimization (e.g., gang or group assault, assault, bullying, and dating violence);
(d) sexual victimization (e.g., sexual assault, rape, flashing/sexual
exposure, verbal sexual harassment, and statutory rape and sexual misconduct) and (e) witnessing violence and indirect victimization (e.g.,
witnessing domestic violence, assault, burglary, murder, and shooting).
Additional victimization variables include sexual or pornographic photos
of child (dropped from the revised JVC), experiences and concern for
other adverse life events (e.g., kidnapping, natural disasters, bad accidents, incarcerated parents), and exposure to media violence and subsequent impact (e.g., exposure to the 9/11 attacks and the DC sniper
shootings). Behavioral variables include information about delinquency,

Secondary Datasets

school, and leisure behaviors. The dataset includes data on 17 delinquency behaviors during the past year, including such variables as breaking things, hitting, stealing, cheating, skipping school, graffiti, loudness,
weapon possession, not paying for things, drinking, smoking and using
drugs. The DVS also collects information about a child’s school attitudes
and behaviors, such as whether the child likes school, how often she talks
with a parent about school, her grades, special services received at school,
involvement in sports or clubs, homework, after-school care, transportation home from school, and where she spends time when in school. If the
child attends day care, variables include the type of caregiver and hours
spent with someone other than the parent or relative in the home.
Parental variables include status of the parents in the household
relative to the child, such as adoptive father, stepfather, biological
mother, adoptive mother, stepmother, mother’s unmarried partner (not
a parent to child), father’s unmarried partner (not a parent to child),
and the age at which the child stopped living with her biological family
(if applicable).
Parental, household and neighborhood characteristics provide
information about the environments in which the child resides. Parental
characteristics provide data on their warmth, supervision and criticism
of their child, communication activities, and knowledge about the child’s
activities and friends. Household variables include income and public
assistance receipt (including TANF, Women Infants and Children [WIC]
program, SSI, and SS), exposure to illegal drugs, marital status of parental respondent, household roster, living arrangements of child (e.g., with
adoptive or biological family), caregiving information, and experiences
with such events as homelessness, removal from the home, and parental
incarceration. Neighborhood variables include the degree to which
school, neighborhood, and town/city violence is a problem.
The DVS dataset makes it possible to look at both risk factors and
outcomes of childhood victimization, adversity, and exposure to violence. Given the array of variables, researchers can control for child,
parent, or household characteristics that may contribute to similar health
and behavior outcomes to isolate the victimization experience. The data
enable the researcher to study specific individual victimizations or
multiple victimizations treated separately or together. The Juvenile
Victimization Questionnaire was designed to use the same offense



Secondary Data Analysis

categories as the National Crime Victimization survey so that DVS data
can be compared with other crime statistics.
Researchers have used the DVS to study polyvictimization (Finkelhor,
Ormrod, Turner, & Hamby, 2005) as well as the relationship between
victimization and delinquency (Cuevas, Finkelhor, Turner, & Ormrod,
2007), polyvictimization and trauma symptoms (Finkelhor, Ormrod, &
Turner, 2007) and family structure and victimization (Turner, Finkelhor,
& Ormrod, 2007). Other studies provide information about differences
across ages in number and forms of violence (Finkelhor, Ormrod, &
Turner, 2008; Finkelhor, Ormrod, Turner, & Hamby, 2005) and in the
nature and impact of peer and sibling violence (Finkelhor, Turner, &
Ormrod, 2006). Researchers have also studied revictimization risk patterns (Finkelhor, Ormrod, & Turner, 2007), sociodemographic variation
in exposure to violence (Turner, Finkelhor, & Ormrod, 2006) and victimization as a predictor of children’s mental health (Turner, Finkelhor,
& Ormrod, 2006).
Data from the 2002–2003 wave are available through the National
Data Archive on Child Abuse and Neglect (http://www.ndacan. The codebook for the
DVS is available at

The Early Childhood Longitudinal Survey (ECLS) consists of three longitudinal studies that gather data on children’s early life experiences;
cognitive, social, emotional, and physical development; home and school
environments; school readiness; and pre-school and school experiences.
The three studies include one birth and two overlapping school-age
cohorts: the birth cohort (ECLS-B), the kindergarten class of 1998–1999
(ECLS-K), and the kindergarten class of 2010–2011 (ECLS-K:2011).
Children, parents, child care providers, teachers, and school administrators provide data. Fathers also respond about their relationships with
their children. The ECLS allows researchers to study how family, school,
community, and individual factors affect school performance and how
early childhood experiences affect later developments.

Secondary Datasets

All of the ECLS samples are nationally representative. The birth
cohort is representative of children born in 2001, whereas the
kindergarten cohorts of 1998–1999 and 2010–2011 are representative
of kindergarten children attending public and private, full- and partialday kindergarten during the respective sampling year. In total, 19,173
children participated in the study, and 1,277 schools were asked to
participate. Of these 1,277 schools, 74% agreed to participate. There
was a 64% response rate from the children. The study used data from
children of varied socioeconomic and racial/ethnic backgrounds
and oversampled Asian and Pacific Islander, American Indian, Alaska
Native, and Chinese children; twins; and children with low and very low
Data is collected in a variety of ways. The information from the
ECLS-B and the ECLS-K was obtained through interviews of the sample
members’ parents and school administrators, teacher questionnaires,
and observation of children’s participation in one-on-one assessment
activities. An English proficiency screener was used for English as a
Second Language (ESL) students. The birth cohort also involved a socioemotional direct assessment, which was carried out by videotaping the
child with her parent.
The birth cohort followed children from birth through kindergarten
and includes 14,000 children. Data were collected when the children
were 9 months old (N=approximately 10,700), 2 years old (2003,
N=approximately 9,800), pre-school age (2005, N=approximately
8,900), and age-eligible for kindergarten (2006 and 2007). The birth
cohort dataset contains variables relating to the child’s health, such as
height, weight, and body mass index; general mental ability; fine and
gross motor skills; behaviors such as attentiveness and social engagement; attachment to parent; language and literacy development (and
related parental behaviors to promote such development); math skills;
color knowledge; and education during the years from birth to kindergarten. Community support variables include frequency of visiting with
neighbors and receiving community mental help services. Families
report receipt of public assistance, such as Food Stamps, WIC, and
TANF; they are also asked about periods of food insecurity. A variety
of self-reported mental and emotional health questions are asked of
parents. Different variables are collected at different ages.



Secondary Data Analysis

The kindergarten class of 1998–1999 followed children from
kindergarten through grade 8. Data collection points included kindergarten (1998-1999) and Grade 1 (1999–2000), and the springs of
Grades 3 (2002), 5 (2004) and 8 (2007). This survey collects data on
children’s cognitive, social, emotional, and physical development,
and home environment and school characteristics. Specific questions
are asked in the eighth-grade survey about the child’s height
and weight and the child’s feelings about these, amount of exercise,
physical education classes taken, availability of certain types of food
in school, and the purchasing of food in school, including soda and
vegetables. School administrators are asked about characteristics of
the school, including percent of students by race; participation in
school breakfast and lunch programs; academic options for students;
problems with drugs, crime, and racial tensions; and the involvement of
parents in academic and nonacademic activities. The kindergarten class
of 2010–2011 cohort will follow students from kindergarten through
Grade 5.
The two kindergarten cohorts are able to track children from kindergarten during different policy environments, to see how policy changes
affect outcomes. Thus, changes in policy, such as the No Child Left
Behind Act and expansions in school choice, will allow researchers to see
how these policy actions affect child-related outcomes http://nces.ed.
The data from the Early Childhood Longitudinal Survey have been
used to study different health factors, such as overweight (Judge & Jahns,
2007), depression and mental health (Huang, Wong, Ronzio, & Yu,
2006), birth complications (Beaver & Wright, 2005), and nutrition
(Jacknowitz, Novillo, & Tiehen, 2007). Several research studies have
examined the effects of race, such as the effects of racial identification
(Brunsma, 2005), segregation, (Reardon, Yun, & Kurlaender, 2006), the
effects of racial and ethnic diversity on birthweight (Teitler, Reichman,
Nepomnyaschy, & Martinson, 2007), and the impact of computer technology on African American children (Judge, 2005). Other studies have
examined kindergarten variables, such as the effects of delayed starts
(Datar, 2006), the effects of class size (Milesi & Gamoran, 2006),
and effects of full-day versus half-day kindergarten (DeCica, 2007).
Also, research has been done using ECLS data pertaining to income,

Secondary Datasets

particularly the relationship of material hardship and parenting and
child development (Gershoff, Aber, Raver, & Lennon, 2007), food insecurity and hunger and their effect on learning in the classroom (Winicki
& Jemison, 2003), religion (Bartkowski, Xu, & Levin, 2007), physical
activity (Carlson, Fulton, Lee, Maynard, Brown, Kohl, & Dietz, 2008;
Beets & Foley, 2008), inequality in cognitive skills and academic achievement (Downey, von Hippel, & Broh, 2004; Foster & Miller, 2007), and
special needs (Park, Hogan, & D’Ottavi, 2005).
Data are available from the National Center for Education Statistics
( More specifically, the data and codebooks can be
found at Some of these data are
also available at the International Archive of Education Data (IAED)
(, including the kindergarten class of 1998–99 and the original kindergarten class in third grade.
You can do simple analyses of these data at the IAED site. Some data are
available for public use, and other data require a data license. Some data,
primarily the birth cohort data, are available through Data Analysis
System (DAS) for statistical analysis online.
Below, I use the Survey Document and Analysis system to examine
some cross tabulations for the third-grade class, for the variables capturing “litter near the school” and “attentive teachers in school.”



Secondary Data Analysis

And get the following results.

As you can see from the first table, more elaborate statistics can be
derived using this system than what are given here, but the scope of these
analyses is very limited. By downloading the entire datasets, you can
better control the variables and use them for more sophisticated analyses. Below, the data for the kindergarten base and kindergarten-third
grade sample were downloaded, and SAS was used to merge the two files
together. SAS variable definitions are given when downloading both
datasets, but the user must supply the location and filename of the data,
which are given below. Both SAS files are very large, and only the top of
the data statement for both datasets below are given.
data a;
infile ‘c\10326250\ICPSR_04075\DS0001\
04075-0001-Data.txt’ lrecl=5798 n=2
missover pad;
@1 CHILDID $8.

Secondary Datasets
data b;
infile ‘c:\10326273\ICPSR_03676\DS0001\
03676-0001-Data.txt’ lrecl=5250 n=3;
@1 CHILDID $8.
data c;merge a b;by childid;
NOTE: There were 15305 observations read from the dataset WORK.A.
NOTE: There were 17212 observations read from the dataset WORK.B.
NOTE: The dataset WORK.C has 17707 observations and 8848 variables.

Data c gives you the merged dataset of the two waves of data.

The Fragile Families and Child Well-Being Study (FFCWS), a longitudinal study, gathers demographic, relationship, health, well-being, and
parental capacity data on non-marital couples, facilitating examination
of how these factors, as well as contextual and environmental conditions,
affect their children. Particular attention is paid to fathers. To date, the
study has followed couples and their children over a nine-year period,
with initial interviews occurring between February 1998 and September
2000. The initial interviews involved each parent and were conducted in
the hospital within 24 hours of the child’s birth. Subsequent interviews
occurred when the children were 1, 3, and 5 years old. A nine-year
follow-up was conducted between the summer of 2007 and the end of
2009, incorporating the core study, an in-home study, and a teacher
study. The survey includes data collected through parental interviews,
in-home interviews, and collaborative studies using administrative
records, in-depth qualitative interviews, and surveys.
Data were collected from 4,898 families, including 3,712 unmarried
couples and 1,186 married couples, from 20 U.S. cities with populations
over 200,000. The sample sizes for the follow-up years are: 4,270 mothers



Secondary Data Analysis

of whom 1,029 were married and 3,241 were unmarried at the time of
birth (one year); 4,140 mothers of whom 1,012 were married and 3,128
were unmarried (three year); and 4,055 mothers of whom 975 were married and 3,080 were unmarried (five year). Approximately half the sample
is non-Hispanic Black and a third is Hispanic. National weights make the
data of 16 of the 20 cities representative of nonmarital births in U.S. cities
with populations over 200,000. City weights can be applied to make the
data representative of the sample cities, an option that may be of particular benefit to those wishing to examine conditions in cities that were strategically sampled so as to maximize variability in economic and policy
Parental interviews gathered information on attitudes, relationships,
parenting behavior, demographic characteristics, mental and physical
health, economic and employment status, neighborhood characteristics,
and program participation. The mother’s questionnaire included far
more comprehensive birth father data than are available in most other
data sets, which allows for comparisons of fathers’ perceptions of their
roles and relationships and mothers’ perceptions of the same. The
in-home interview gathered information on children’s cognitive and
emotional development, health, and home environment. Studies developed in collaboration with the FFCWS provide additional information
on parents’ medical, employment, and incarceration histories; religion;
child care; and early childhood education.
The mother and father datasets contain 333 and 338 variables,
respectively. The FFCWS covers numerous variables including: parental
sexual activity; contact with Child Protective Services regarding sexual
abuse; incarceration; home environment; neighborhood information;
foster care; disability status of parent(s) and/or children; government
program participation; and health insurance coverage. The FFCWS also
takes into c