+1 (208) 254-6996 essayswallet@gmail.com

Mrs. Anderson shared the superintendent’s comments with the oversight team and evaluation subcommittee. Like the superintendent, team members felt conflicted by the choice between simpler logistics or a stronger evaluation design. Dr. Elm understood the dilemma all too well, but as an evaluator and an educator, she believed that a strong evaluation would result in improved program implementation and improved program outcomes.

Dr. Elm recognized that implementing the program in all classrooms in one grade level across the district would offer the weakest evaluation design and the least useful information but would likely be the simplest option logistically. Another option would be to start the program in all classrooms at two or three schools. In such a case, the other schools could be used as comparisons. For this reason, Dr. Elm explored the comparability of the six elementary schools in case the team decided to go that route. Five of the elementary schools had somewhat comparable state test scores in reading, while the sixth school had lower state test scores, and the difference was statistically significant. In addition, schools one through five had similar (and fairly homogenous) populations, while school six had a much lower socioeconomic student population and a much higher percentage of ELL students. Because the district was interested in how the program worked with ELL students, the team knew that the evaluation needed to include school six. However, if school six were used in a three-school implementation, the team would not have a comparable school against which to benchmark its results.

Don't use plagiarized sources. Get Your Custom Essay on
Mrs. Anderson shared the superintendent’s comments with the oversight team and evaluation subcommittee.
Just from $13/Page
Order Essay

While not the simplest option, the oversight team decided that its best option would be to structure the program in such a way as to maximize the quality of the information from the evaluation. The team chose to build a strong evaluation into the READ program design to provide the formative information needed for program improvement and valid summative information for accountability.





Follow progress on your logic model indicators carefully along the way, so you continually know how your program is doing and where it should be modified. And when the time does come to examine results in terms of your long-term goals, your logic model is critical to explaining your findings. While you may not be able to rule out all competing explanations for your results, you can provide a plausible explanation based on your program’s logic that your program activities are theoretically related to your program findings.

Finally, as mentioned above, the strength of your evaluation design, or the design rigor, directly impacts the degree to which your evaluation can provide the program with valid ongoing information on implementation and long-term goals regarding the success of the program. A strong evaluation design is one that is built to provide credible information for program improvement, as well as to rule out competing explanations for your summative findings. A strong evaluation design coupled with positive findings is what you might hope for, but even a strong evaluation that provides findings showing dismal results from a program provides valuable and important information. Evaluation results that help you to discontinue programs that do not work are just as valuable as findings that enable you to continue and build upon those programs that do improve student outcomes.

Evaluation Methods and Tools

You have almost completed your evaluation design. The most difficult part is over—you have defined your program and built your evaluation into your program’s logic model. Using your logic model as a road map, you have created evaluation questions and their related indicators. You have decided how your evaluation will be designed. Now, how will you collect your data? You may have thought about this during the discussion on creating indicators and setting targets. After reading through the following paragraphs on methods that you might use in your evaluation, revisit your indicators to clarify and refine the methods you will use to measure each indicator.





Based on the READ oversight team’s decision about how to structure the program, Dr. Elm and the E-Team drafted the following evaluation design. They presented the design at the next oversight team meeting. The oversight team voted to approve the design as follows:

Design: Multiple-group, experimental design (students randomly assigned to classrooms by the school prior to the start of the school year and classrooms randomly assigned to the READ program group or a non-READ comparison group)

Program group (READ): 40 classrooms (22 to 25 students per classroom)

Comparison group (non-READ): 40 classrooms (22 to 25 students per classroom)

Classrooms will be stratified by grade level within a school and randomly assigned to either the READ program group or a comparison group. The READ and non-READ groups will each include 14 third-grade classrooms, 14 fourth- grade classrooms, and 12 fifth-grade classrooms.

Enriching the evaluation design: Program theory and logic modeling will be used to examine program implementation as well as short-term, intermediate, and long-term outcomes.



Although there are many evaluation methods, most are classified as qualitative, quantitative, or both. Qualitative methods rely primarily on noncategorical, free responses or narrative descriptions of a program, collected through methods such as open-ended survey items, interviews, or observations. Quantitative methods, on the other hand, rely primarily on discrete categories, such as counts, numbers, and multiple-choice responses. Qualitative and quantitative methods reinforce each other in an evaluation, as qualitative data can help to describe, illuminate, and provide a depth of understanding to quantitative findings. For this reason, you may want to choose an evaluation design that includes a combination of both qualitative and quantitative methods, commonly referred to as mixed-method. Some common evaluation methods are listed below and include assessments and tests; surveys and questionnaires; interviews and focus groups; observations; existing data; portfolios; and case studies. Rubrics are also included as an evaluation tool that is often used to score, categorize, or code interviews, observations, portfolios, qualitative assessments, and case studies.

Assessments and tests (typically quantitative but can include qualitative items) are often used prior to program implementation (pre) and again at program completion (post), or at various times during program implementation, to assess program progress and results. Results of assessments are usually objective, and multiple items can be used in combination to create a




subscale, often providing a more reliable estimate than any single item. If your program is intended to improve learning outcomes, you will likely want to use either an existing state or district assessment or choose an assessment of your own to measure change in student learning. However, before using assessment or test data, you should be sure that the assessment adequately addresses what you hope your program achieves. You would not want the success or failure of your program to be determined by an assessment that does not validly measure what your program is intended to achieve.

Surveys and questionnaires (typically quantitative but can include qualitative items) are often used to collect

information from large numbers of respondents. They can be administered online, on paper, in person, or over the phone. In order for surveys to provide useful information, the questions must be worded clearly and succinctly. Survey items can be open-ended or closed-ended.

Reliability and validity are important considerations when selecting and using instruments such as assessments and tests (as well surveys and questionnaires).

Reliability is the consistency with which an instrument assesses (whatever it assesses). Reliability may refer to any of the following elements:

The extent to which a respondent gives consistent responses to multiple items that are asking basically the same question in different ways (internal consistency reliability).

The extent to which individuals’ scores are consistent if given the same assessment a short time later (test-retest reliability).

The extent to which different raters give consistent scores for the same open-ended response or different observers using an observation protocol give consistent scores for the same observation (inter-rater reliability).

(See next page for information on validity.)


Open-ended survey items allow respondents to provide free-form responses to questions and are typically scored using a rubric. Closed-ended items give the respondent a choice of responses, often on a scale from 1 to 4 or 1 to 5. Surveys can be quickly administered, are usually easy to analyze, and can be adapted to fit specific situations.




Validity refers to how well an instrument measures what it is supposed to or is claims to measure. An assessment is not simply valid or not valid but rather valid for a certain purpose with a certain population. In fact, the same assessment may be valid for one group but not for another. For example, a reading test administered in English may be valid for many students but not for those in the classroom who are ELL.

Traditional views of validity classify the validity of a data collection instrument into three types: content validity, construct validity, and criterion-related validity.

Content validity addresses whether an instrument asks questions that are relevant to what is being assessed.

Order your essay today and save 10% with the discount code ESSAYHELP