Avoiding a Rush to Judgment: Teacher Evaluation and Teacher Quality

Comprehensive methods of evaluating teachers that avoid the typical "drive-by" evaluations can promote improvements in teaching.

The troubled state of teacher evaluation is a glaring and largely neglected problem in public education, one with consequences that extend far beyond the current debate over performance pay. Because teacher evaluations are at the center of the educational enterprise — the quality of teaching in the nation's classrooms — they are a potentially powerful lever of teacher and school improvement. But that potential is being squandered throughout public education, an enterprise that spends $400 billion annually on salaries and benefits.

The task of building better evaluation systems is as difficult as it is important. Many hurdles stand in the way of rating teachers fairly on the basis of their students' achievement, the solution favored by many education experts today. And it's increasingly clear that it's not enough merely to create moredefensible systems for rewarding or removing teachers. Teacher evaluations pay much larger dividends when they also play a role in improving teaching.

This article explores the causes and consequences of the crisis in teacher evaluation. And it examines a number of national, state, and local evaluation systems that point to a way out of the evaluation morass. Together, they demonstrate that it's possible to evaluate teachers in much more productive ways than most public schools do today.


It's hard to expect people to make a task a priority when the system they are working in signals that the task is unimportant. That's the case with teacher evaluation.

Public education defines teacher quality largely in terms of the credentials that teachers have earned, rather than on the basis of the quality of the work they do in their classrooms or the results their students achieve.

It's not surprising, then, that measuring how well teachers teach is a low priority in many states. The nonprofit National Council on Teacher Quality (NCTQ) reports that, despite many calls for performance pay coming from state capitals, only fourteen states require school systems to evaluate their public school teachers at least once a year, while some are much more lax than that. Tennessee, for example, requires evaluations of tenured teachers only twice a decade (NCTQ 2007a).

An NCTQ analysis of the teacher contracts in the nation's fifty largest districts (which enroll 17 percent of the nation's students) suggest that not much teacher evaluation is enshrined in local regulations, either. Teachers union contracts dictate the professional requirements for teachers in most school districts. But the NCTQ study found that only two-thirds of them require teachers to be evaluated at least once a year and a quarter of them require evaluations only every three years (NCTQ 2007b).

The evaluations themselves are typically of little value — a single, fleeting classroom visit by a principal or other building administrator untrained in evaluation wielding a checklist of classroom conditions and teacher behaviors that often don't even focus directly on the quality of teacher instruction. "It's typically a couple of dozen items on a list: 'Is presentably dressed,' 'Starts on time,' 'Room is safe,' 'The lesson occupies students,'" says Michigan State University professor Mary Kennedy, author of Inside Teaching: How Classroom Life Undermines Reform, who has studied teacher evaluation extensively. "In most instances, it's nothing more than marking 'satisfactory' or 'unsatisfactory.'"

It's easy for teachers to earn high marks under these capricious rating systems, often called "drive-bys," regardless of whether their students learn. Raymond Pecheone, co-director of the School Redesign Network at Stanford University and an expert on teacher evaluation, suggests by way of example that a teacher might get a "satisfactory" check under "using visuals" by hanging up a mobile of the planets in the Earth's solar system, even though students could walk out of the class with no knowledge of the sun's role in the solar system or other key concepts. These simplistic evaluation systems also fail to be remotely sensitive to the challenges of teaching different subjects and different grade levels, adds Pecheone.

Unsurprisingly, the results of such evaluations are often dubious. Donald Medley of the University of Virginia and Homer Coker of Georgia State University reported in a comprehensive 1987 study, "The Accuracy of Principals' Judgments of Teacher Performance," that the research up to that point found the relationship between the average principal's ratings of teacher performance and achievement by the teachers' students to be "near zero."

Principals fared better in a recent study by Brian Jacob of Harvard's Kennedy School of Government and Lars Lefgren of Brigham Young University (2005) that compared teacher ratings to student gains on standardized tests. Principals were able to identify with some accuracy their best and worst teachers — the top 10 or so percent and the bottom 10 or so percent — when asked to rate their teachers' ability to raise math and reading scores.

Principals use evaluations to help teachers improve their performance as rarely as they give unsatisfactory ratings. They frequently don't even bother to discuss the results of their evaluations with teachers.

But principals don't put even those minimal talents to use in most public school systems. A recent study of the Chicago school system by the nonprofit New Teacher Project (2007), for example, found that 87 percent of the city's 600 schools did not issue a single "unsatisfactory" teacher rating between 2003 and 2006. Among that group of schools were sixty-nine that the city declared to be failing educationally. Of all the teacher evaluations conducted during those years, only 0.3 percent produced "unsatisfactory" ratings, while 93 percent of the city's 25,000 teachers received top ratings of "excellent" or "superior."

And principals use evaluations to help teachers improve their performance as rarely as they give unsatisfactory ratings. They frequently don't even bother to discuss the results of their evaluations with teachers. "Principals are falling prey to fulfilling the letter of the law," says Dick Flannery, director of professional development for the National Association of Secondary School Principals, a principals' membership organization. "They are missing the opportunity to use the process as a tool to improve instruction and student achievement."

New models

A small number of local, state, and national initiatives have sought a different solution to drive-by evaluations — comprehensive evaluation systems that measure teachers' instruction in ways that promote improvement in teaching.

The Teacher Advancement Program (TAP) is a good example. Launched by the Milken Family Foundation in 1999 and now operated by the nonprofit, California-based National Institute for Excellence in Teaching, TAP is a comprehensive program to strengthen teaching through intensive instructional evaluations, coaching, career ladders, and performance- based compensation. It's now in 180 schools with 5,000 teachers and 60,000 students in five states and the District of Columbia.

Standards for Teaching

TAP measures teaching against standards in three major categories — designing and planning instruction, the learning environment, and instruction — and nineteen subgroups targeting things like how well lessons are choreographed, the frequency and quality of classroom questions, and ensuring that students are taught challenging skills like drawing conclusions.

Schools using TAP evaluate their teachers using a rubric that rates performance as "unsatisfactory," "proficient," or "exemplary." Standards and rubrics such as TAP's "create a common language about teaching" for educators, says Katie Gillespie, a fifth-grade teacher at DC Preparatory Academy, a District of Columbia charter school in its third year of using TAP. "That's crucial," says Gillespie.

Connecticut's Beginning Educator Support and Training Program (BEST), the nation's first — and, until recently, only — statewide evaluation system, draws heavily on the state's teachers in drafting standards.

The Connecticut Department of Education established BEST in 1989 to strengthen its teaching force by supplying new teachers with mentors and training and then requiring them in their second year to submit a portfolio chronicling a unit of instruction. The unit needs to involve at least five hours worth of teaching, to capture how teachers develop students' understanding of a topic over time, something "drive-by" evaluations can't and don't do.

State-trained scorers evaluate the portfolios from four perspectives — instructional design, instructional implementation, assessment of learning, and teachers' ability to analyze teaching and learning — using four standards: conditional, competent, proficient, and advanced. The state established committees of top Connecticut teachers to draft the standards, which were circulated to hundreds of teachers, administrators, and higher-education faculty members for comment.

The nonprofit National Board for Professional Teaching Standards also has sponsored a large-scale system of teacher evaluations. It has conferred advanced certification in sixteen subjects on some 63,000 teachers nationwide since its inception in 1987, using a two-part evaluation: candidates submit a Connecticut-like portfolio and complete a series of half-hour online essays.

Teams of teachers from around the country draft standards in each certification area, and hundreds of teachers, administrators, and state and federal officials comment before the standards are finalized. The Educational Testing Service (ETS) manages the evaluation system under a contract with the National Board.

Multiple Measures

While traditional evaluations tend to be one-dimensional, relying exclusively on a single observation of a teacher in a classroom, the comprehensive models capture a much richer picture of a teacher's performance.

Comprehensive models capture a much richer picture of a teacher's performance. The National Board portfolios include lesson plans, instructional materials, student work, two twenty-minute videos of the candidate working with students in classrooms, teachers' written reflections on the two taped lessons, and evidence of work with parents and peers.

The National Board portfolios, for example, include lesson plans, instructional materials, student work, two twenty-minute videos of the candidate working with students in classrooms, teachers' written reflections on the two taped lessons, and evidence of work with parents and peers. That's on top of the six online exercises that National Board candidates take at one of 400 evaluation centers around the country to demonstrate expertise in the subjects they teach.

In total, National Board candidates spend between 200 and 400 hours demonstrating their proficiency in five areas: commitment to students' learning, knowledge of subject and of how to teach it, monitoring of student learning, ability to think systematically and strategically about instruction, and professional growth.

An advantage of portfolios is that, unlike standardized-test scores, they can be used to evaluate teachers in nearly every discipline. National Board certification is open to some 95 percent of elementary and secondary teachers.


Another way to counter the limited, subjective nature of many conventional evaluations is to subject teachers to multiple evaluations by multiple evaluators.

In schools using TAP, teachers are evaluated at least three times a year against TAP's teaching standards by teams of "master" and "mentor" teachers that TAP trains to use the organization's evaluation rubrics (master teachers are more senior and do less teaching than mentors). Schools combine the scores from the different evaluations and evaluators into an annual performance rating.

TAP evaluators must demonstrate an ability to rate teachers at TAP's three performance levels before TAP lets them do "live" teacher evaluations. Then TAP requires schools using the program to enter every evaluation into a TAP-run online Performance Appraisal Management System that produces charts and graphs of evaluation results, which are used to compare a school's evaluation scores to TAP evaluation trends nationally. And every year TAP ships videotaped lessons to evaluators that they must score accurately using TAP's performance levels as a prerequisite for continuing as TAP evaluators.

In Connecticut, every BEST portfolio is scored using the program's standards by three state-trained teacher-evaluators who teach the same subject as the candidate. Failing portfolios are rescored by a fourth evaluator. As in the TAP program, scorers must complete nearly a week's worth of training and demonstrate an ability to score portfolios accurately before participating in the program.

Not surprisingly, using evaluators with backgrounds in candidates' subject and grade levels, as TAP and BEST do, strengthens the quality of evaluations. "Good instruction doesn't look the same in chemistry as in elementary reading," says Mike Gass, executive director of secondary education in Eagle County, Colorado, where the district's fifteen schools use TAP.

Under traditional evaluations — done as they are by principals or assistant principals — it's rarely possible to use evaluators with backgrounds in the candidate's teaching area, especially at the middle and high school levels, where teachers typically teach only one subject. Many evaluations, as a result, focus on how teachers teach, at the expense of what they teach. Evaluators, writes Michigan State's Kennedy, "are rarely asked to evaluate the accuracy, importance, coherence, or relevance of the content that is actually taught or the clarity with which it is taught" (Kennedy 2007).

Subject-area and grade-level specialists, scoring rubrics, evaluator training, and recertification requirements like TAP's increase the "inter-rater reliability" of evaluations. They produce ratings that are more consistent from evaluator to evaluator and that teachers are more likely to trust.

Places to Grow

Unlike traditional teacher evaluations, these systems are part of programs to improve teacher performance, not merely weed out bad apples. They are drive-in rather than drive-by evaluations. At a time when research is increasingly pointing to working conditions as being more important than higher pay in keeping good teachers in the classroom, the teachers in the comprehensive evaluations programs say that the combination of extensive evaluations and coaching that they receive helps make their working conditions more professional, and thus more attractive.

At DC Preparatory Academy, which serves 275 middle school students in northeastern Washington, D.C., using evaluations to strengthen teaching is part of the fabric of the school. The school opened in 2003 and brought on TAP in 2005. And in the TAP model, a key role of evaluations by master and mentor teachers is identifying the teachers' weaknesses that mentors will work on with teachers during the six weeks between evaluations.

"I felt I was a really good teacher before I got here," says Gillespie, in her second year at DC Prep after spending four years teaching in nearby Fairfax County, Virginia. "I got really high marks on my evaluations [in Fairfax]. But holy moly, I've learned under TAP that I've got a lot of places to grow." Some studies have suggested that teachers' performance plateaus after several years in the classroom. But few teachers in public education get the sort of sophisticated coaching that Gillespie receives under TAP; if more did, perhaps studies would reveal that their performance continued to improve.

"It makes a difference when people are constantly there to help you," adds Gillespie's colleague, seventh-grade English teacher Geoff Pecover. "The expectations are high. My principal last year in DCPS [the District of Columbia Public Schools, where Pecover taught for three years] showed up to evaluate my class with the evaluation form already filled out, and the post-conference was a waste of time. You didn't feel like you were learning anything."

To further strengthen the relationship between evaluation and instruction, TAP requires schools to have weekly, hour-long "cluster" meetings where master/mentor teachers work with teams of teachers of a particular subject or grade level.

Cost factors — time and money

Not surprisingly, comprehensive classroom evaluation systems are more time-consuming and more expensive than once-a-year principal evaluations or evaluations based only on student test scores.

In schools with complex models like TAP's, the administrative challenges of training and retraining evaluators, conducting classroom visits, and tying the evaluation system to teacher professional development activities are daunting. "We didn't realize how demanding it was," says Natalie Butler, DC Prep's principal. "You just have to make the investment."

TAP and other comprehensive evaluation models also are a lot more demanding on teachers under evaluation. The upward of 400 hours some candidates for National Board certification spend in that process suggests as much, and the demands are even greater on teachers facing multiple evaluations and follow-up work under programs like TAP. "The typical teacher evaluation process puts teachers in a passive role," says Catherine Fiske Natale, a Connecticut official with the state's BEST program. "This is different." But it is not unprecedented, at least by international standards. Researchers Shujie Liu of the University of Southern Mississippi and Charles Teddlie of Louisiana State University (2005) report in a study of Chinese teacher evaluation practices that Chinese teachers are expected to observe the classes of other teachers as many as fifteen times a semester and write a 1,500-word essay every semester on some aspect of their teaching experience.

At $1,000 per teacher, it would cost $3 billion a year to evaluate the nation's three million teachers using a Connecticut — or National Board — like portfolio or TAP's multiple evaluations — multiple evaluators model. By way of contrast, public education's price tag has surpassed $500 billion a year, including some $14 billion (about $240 per student) for teachers to take "professional development" courses and workshops that teachers themselves say don't improve their teaching in many instances.

Yet many school systems have been reluctant to use these resources on comprehensive evaluation systems such as TAP's. "It is really difficult to get them to use Title II monies," says Kristan Van Hook, TAP's senior vice president for public policy and development, referring to the section of NCLB that funnels some $3 billion in teacherimprovement grants to the nation's school systems. "They are very reluctant to change how they spend that money. It's tied up in things like salaries for reading tutors and class-size reduction."

Sending a message

Comprehensive evaluations — with standards and scoring rubrics and multiple classroom observations by multiple evaluators and a role for student work and teacher reflections — are valuable regardless of the degree to which they predict student achievement, and regardless of whether they're used to weed out a few bad teachers or a lot of them. They contribute much more to the improvement of teaching than today's drive-by evaluations or test scores alone. And they contribute to a much more professional atmosphere in schools.

Comprehensive evaluations are valuable regardless of the degree to which they predict student achievement. They contribute much more to the improvement of teaching than today's drive-by evaluations.

As a result, they make public school teaching more attractive to the sort of talent that the occupation has struggled to recruit and retain. Capable people want to work in environments where they sense they matter, and using evaluation systems as engines of professional improvement signals that teaching is such an enterprise. Comprehensive evaluation systems send a message that teachers are professionals doing important work.

But superficial principal drivebys will continue to pervade public education — and teacher evaluation's potential as a lever of teacher and school improvement will continue to be squandered — if school systems and teachers unions lack incentives to do things differently.

Ultimately, the single salary schedule may be the most stubborn barrier to better teacher evaluations. As Kate Walsh, president of the National Council on Teacher Quality and memberdesignate of the Maryland State Board of Education, says: "If there are no consequences for rating a teacher at the top, the middle, or the bottom, if everyone is getting paid the same, then why would a principal spend a lot of time doing a careful evaluation? I wouldn't bother." Many teachers unions, of course, argue that the failure of principals to take evaluations seriously requires a single salary schedule.

There's no simple solution to this Catch-22. But TAP, for one, has addressed it head-on by combining comprehensive evaluations that teachers trust with performance pay. The program's comprehensive classroom evaluations legitimize performance pay in teachers' minds, and its performancepay component gives teachers and administrators alike a compelling reason to take evaluations seriously. Pay and evaluations become mutually reinforcing, rather than mutually exclusive.


Click the "References" link above to hide these references.

Jacob, B. A., and L. Lefgren. 2005. "Principals as Agents: Subjective Performance Measurement in Education." Working Paper 11463. Cambridge, MA: National Bureau of Economic Research.

Kennedy, M. M. 2007. "Recognizing a Good Teacher When You See One." Unpublished paper. East Lansing, MI: Michigan State University (June).

Liu, S., and C. Teddlie. 2005. "A Follow-up Study on Teacher Evaluation in China: Historical Analysis and Latest Trends," Journal of Personnel Evaluation in Education 18, no. 5: 253-272.

Medley, D., and H. Coker. 1987. "The Accuracy of Principals' Judgments of Teacher Performance," Journal of Educational Research 80, no. 4.

National Council on Teacher Quality. 2007a. State Teacher Policy Yearbook, National Summary, 2007. Washington, DC: NCTQ.

National Council on Teacher Quality. 2007b. Teacher Rules, Roles, and Rights. Washington, DC: NCTQ. Available online at www.nctq.org/cb

New Teacher Project. 2007. Hiring, Assignment, and Transfer in Chicago Public Schools. New York: New Teacher Project.

Toch, T. and Rothman, R. (2008). Avoiding a Rush to Judgment: Teacher Evaluation and Teacher Quality. Voices in Urban Education, No. 20, Summer 2008.


You are welcome to print copies for non-commercial use, or a limited number for educational purposes, as long as credit is given to Reading Rockets and the author(s). For commercial use, please contact the author or publisher listed.


TAP is subjective and ineffective. I know because every school that I am aware of in our state the moral and test scores have declined.

Add comment

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
"Books to the ceiling, Books to the sky, My pile of books is a mile high. How I love them! How I need them! I’ll have a long beard by the time I read them." —

Arnold Lobel