Frequently Asked Questions
What is the difference between “assessment” and “evaluation”?
The primary difference between evaluation and assessment is in the focus and purpose of examination. Assessment involves measurement of a variable of interest. Assessment provides data that are relevant only in context. Evaluation uses results of assessment(s) to facilitate an initiative’s development, implementation, and/or improvement by examining its processes and/or outcomes. Evaluation includes judgment about some aspect of the initiative.
As an example, an assessment may find that 30% of Grade 3 students are not reading at grade level. While results of this assessment provide important information for intervention, these results become more meaningful by comparing them, for example, to previous assessment results, to results of a comparable group, or to expected outcomes of a reading program in which students participated.
Evaluation contextualizes data and information, responding to the question – “So what?” In this example, if in the prior year 50% of the same group of students in Grade 2 were not reading at grade level, then 30% in Grade 3 represents an improvement for these students. Conversely, if among a comparable group of peers only 10% of students are not reading at grade level in Grade 3, these assessment results provide a different perspective that may be key to improving the intervention. And if the reading program in which the students participated “promised” that 80% of students would be reading at grade level by Grade 3, then the assessment results may provide critical information for making decisions about the reading program.
What is the difference between “evaluation” and “research”?
Evaluators use many of the same qualitative and quantitative methodologies used by researchers in other fields. The primary difference between research and evaluation is in the purposes they intend to serve. The purpose of evaluation is to provide information for decision making about particular programs, projects or initiatives, not to advance more wide-ranging knowledge or theory. As Stufflebeam, in Evaluation Models: A New Direction for Evaluation (2001) says “The purpose of evaluation is to improve, not prove.” For example, an evaluation might ask “How effective or ineffective is a particular initiative?”
Evaluation also differs from research in that it is specific to a program or project, asking about progress toward and achievement of its goals. Research is designed to provide results that can be generalized to other populations, settings, or contexts. For example, evaluation findings often are limited to the population being served by a program within the program’s context. Research might ask whether results can be applied in other contexts.
A third difference may be in the types of questions evaluations and research studies address, though there can be substantive overlap in the types of questions an evaluator might pose during a formative evaluation and the types of questions typically posed by researchers.
What are the first steps for research or evaluation design and planning?
Planning and design are the first steps in any systematic inquiry as they lay the foundation upon which to build your investigation. As noted by Light, Singer, and Willett (1990) in By Design: Planning Research on Higher Education, “You can’t fix by analysis what you've bungled by design” so time spent in planning and design is a worthwhile investment.
Though the initial steps of planning and design may not be linear, I always start with determining the purpose of my inquiry. A research study or evaluation can serve different (and multiple) purposes related to the same object of inquiry (“X”). One study may set out to describe or document “X”, another to solve a problem related to “X”, while another may intend to develop deeper understanding of “X” from a particular perspective.
While clarifying the purpose of an inquiry, I generally find myself posing tentative questions that seem, at least in the moment, to be relevant to the object and purpose(s) of my study. I encourage spending sufficient time generating and refining the questions of your inquiry. You likely will have many questions and should distill them until you have a manageable set given the anticipated scope, scale, and resources of your study. Good research and evaluation questions: (a) are related directly to the purposes of the inquiry, (b) can be “answered” via the methods and resources at your disposable, and my personal favorite, (c) are interesting and potentially beneficial to yourself and others. (See FINER criteria as discussed by Huller et al., 2007, in Designing Clinical Research.)
Once questions have been developed, distilled, and refined, I generate a model of my study to enhance my own understanding (and that of others) of how I will progress from questions to anticipated outcomes of my inquiry. Models may be simple and linear or more complex. For example, a logic model () describes aspects of a study visually to show alignment between different aspects. A logic model is a systematic way of linking planned work (resources/inputs, activities) with intended results (outputs and outcomes).
A more complex model, such as a Theory of Change, as discussed by Chen (1990) in Theory-Driven Evaluations, goes beyond a logic model to explain (or theorize) relationships among different aspects of a study and articulate underlying assumptions and rationales that lead to outcomes. Developing a logic model or a theory of change requires the researcher to explicitly determine the purposes of an inquiry and then back-map other relevant aspects of the inquiry.
Once the questions and purpose of an inquiry are determined, the next step is to select methods that are aligned with the questions and purpose. Too often novice researchers select a method (i.e., quantitative, qualitative) based upon preference, expertise, or experience before considering what method will best position them to respond to their questions. At the most basic level, the selection of method includes determining whether your study will use quantitative, qualitative or mixed methods to collect and analyze data. For example, if the study’s purpose is to describe “X”, consider whether quantitative or qualitative data (or both) are most appropriate for this purpose and think ahead to what types of analyses would be most suitable to respond to your questions.
How can I determine the validity and reliability of an instrument?
Validity and reliability are two important factors to consider when developing and testing any instrument (e.g., content assessment test, questionnaire)—they will help to insure the quality of measurement, and quality of data, collected for your study. Validity refers to the degree to which an instrument accurately measures what it intends to measure; while, reliability refers to the degree to which an instrument yields consistent results.
Three common types of validity for researchers and evaluators to consider are content, construct, and criterion validities.
- Content validity indicates the extent to which items adequately measure or represent the content of the property or trait that the researcher wishes to measure. Subject matter expert review is often a good first step in instrument development to assess content validity.
- Construct validityindicates the extent to which a measurement method accurately represents a construct (e.g., a latent variable or phenomena that can't be measured directly, such as a person's attitude or belief) and produces an observation, distinct from that which is produced by a measure of another construct. Common methods to assess construct validity include factor analysis, correlation tests, and item response theory models(including Rasch model).
- Criterion-related validityindicates the extent to which the instrument's scores correlate with an external criterion (i.e., usually another measurement from a different instrument) either at present (concurrent validity) or in the future (predictive validity). A common measurement of this type of validity is the correlation coefficient between two measures.
Common measures of reliability include internal consistency, test-retest, and inter-rater reliabilities.
- Internal consistency reliabilitylooks at the consistency of the score of individual items on an instrument, with the scores of a set of items, or subscale, which typically consists of several items to measure a single construct.Cronbach's alphais one of the most common methods for checking internal consistency reliability.
- Test-retestmeasures the correlation between scores from one administration of an instrument to another, usually within an interval of 2 to 3 weeks. Unlike pre-post tests, no treatment occurs between the first and second administrations of the instrument, in order to test-retest reliability. A similar type of reliability calledalternate forms, involves using slightly different forms or versions of an instrument to see if different versions yield consistent results.
- Inter-rater reliabilitychecks the degree of agreement among raters (i.e., those completing items on an instrument). Common situations where more than one rater is involved may occur when more than one person conducts classroom observations, uses an observation protocol or scores an open-ended test, using a rubric or other standard protocol. Kappa statistics, correlation coefficients, and intra-class correlation (ICC) coefficient are some of the commonly reported measures of inter-rater reliability.
For additional resources, or to read more on understanding validity and reliability, visit the .
How can I anticipate and avoid potential data analysis pitfalls?
Tip #1: Ensure that the analysis is necessary and accurate: Before conducting data analysis, be sure you understand the purpose of the analysis, and the characteristics and components of the data with which you are working.
- Always start with questions, not data or a technique. Take time to formulate questions or hypotheses about the measurable outcomes/impacts related to the data you are analyzing to ensure that you are collecting appropriate data or to identify possible gaps in data already collected. Don’t get hung up on using a “favorite technique” or any other fancy statistical method for your analysis, if it’s not appropriate for your data and inquiry. This could limit the scope of your work by leading you to focus only on certain evaluation questions or data.
- Be aware of data “vital signs.” During early stages of analysis, and before conducting analyses to respond to research or evaluation questions, you should check for data “vital signs” (i.e., frequencies, missing data). This is a critical step for “validating” your data, and may assist in detecting problems with the data.
- One slice of data vs. all. Consider “slicing the data” you are working with (e.g., filtering the data to define sub-groups), particularly if you have a large data set, in order to find out if the data reveal differences in results for affected subgroups (e.g., gender, race, grade-level). Looking at a few slices of data for internal consistency gives you greater confidence that you are measuring what you intended to measure. Slicing also can save time—that is, instead of working with millions of records, you can , in order to test your coding, detect key relationships, and/or make decisions for next steps of your analysis.
Tip #2: Ensure the analysis is appropriate: Once you have defined the purpose of your study, and you have a good understanding of the data you will be working with, you can determine the type of statistical methods to use.
- Check data distribution. Most often, findings from a data set are represented using summary statistics (e.g., means, median, standard deviation). While these statistics can be accurate forms of measurement for a data set, you also should consider data distribution, by using .These will allow you to see important or interesting features of the data, such as a significant class of outliers, skewness, or kurtosis (Field, 2000 & 2009; Gravetter & Wallnau, 2014; Trochim & Donnelly, 2006).
- Observe and explore the outliers. If appropriate, investigate any outliers in your data. Outliers can raise or reveal fundamental problems with your analysis, particularly, if the outliers contain patterns. If so, consider conducting exploratory analyses to find the reason for the patterns. You should consider the purpose of your study, research questions, and sample-size, before deciding whether to exclude outliers from your data, or before modifying or lumping them together into reporting categories(Morrow, 2016).
- Don't stop at “p-value < .05.” Typically, people rely on the p-value and consider p < .05 as the gold standard for “statistically significant” findings. This practice ignores the fact that if the sample is large, nearly any difference, no matter how small or meaningless from a practical view, will be “statistically significant.” Consider specifying confidence intervals and effect size, together with p-value, to covey the magnitude and relative importance of an effect and to reach a more rigorous conclusion (Helberg, 1996; Nuzzo, 2014).
Tip #3: Ensure that you are explaining results properly: After examining your data or data sets, and applying appropriate statistical methods to analyze data, you'll then need to determine how to interpret and report your findings.
- Correlation is not causation. Correlation is a tool every data analyst uses frequently. The biggest caution in using and interpreting correlation analyses is that they should not be treated as causation (Helberg, 1996). If two events happen close to each other or around the same time, that does not necessarily mean that one causes the other. Random assignment, time-order relationship, and covariation are required in order to determine a .
- Provide interpretation for lay audiences. Increasingly, it is the statistician or data analyst's responsibility to provide context behind the numbers, particularly when presenting your analyses and results to people who are not data experts. This may include explaining the definition of statistical terms, such as “confidence intervals” or “correlation;” explanations of typical effect sizes (good and bad); and/or reasoning why a particular statistical method is unreliable or unfit for a specific project.
- Converse and cross-check with colleagues. Sharing your analyses with colleagues before presenting or/and sharing findings with others is a very good way of cross-checking and validating your work. Colleagues can offer opinions and suggestions based on their own experiences and additional expertise to detect possible inconsistencies, to correct illegal values (i.e., values outside of a domain range), or to find conflicts between your findings and those of previous research.
For additional resources, or to read more on understanding validity and reliability, visit the