Skip to main content
Data-Informed Reflection icon

Student Evaluations of Teaching and Holistic Teaching Evaluations

Explore research-informed approaches to interpreting student evaluations of teaching, with strategies to recognize bias and strengthen your teaching narrative.

Student Evaluations of Teaching (SETs) remain among the most common and consequential tools for assessing teaching effectiveness in higher education. Despite their widespread use in hiring, promotion, tenure, salary, and awards decisions, a growing body of research demonstrates that SETs are shaped by systematic biases unrelated to actual teaching quality. Drawing on existing research and established best practices, this report outlines:

  • What SETs can and cannot tell us
  • Documented sources of bias in student evaluations
  • Recommendations for more responsible interpretation of SET data
  • Strategies for strengthening teaching narratives using multiple forms of evidence

Student Evaluations of Teaching (SETs)

What They Measure and What They Do Not

Student Evaluations of Teaching, sometimes referred to as student ratings of instruction, student experience questionnaires, or student satisfaction surveys, are among the most commonly used instruments for evaluating teaching effectiveness. While student feedback can offer insights into students’ perceptions of their learning experiences, SETs do not directly measure teaching effectiveness or student learning. Decades of research have raised concerns about their validity, reliability, and susceptibility to bias, particularly when results are overinterpreted or used in isolation (Spooren et al., 2017; Benton & Cashin, 2014; Clayson, 2020).

Overreliance on SETs is especially problematic for faculty from marginalized groups. Research consistently shows that women, faculty of color, and other minoritized instructors face a greater risk of negative evaluations that are unrelated to instructional quality. When used uncritically, SETs can contribute to inequitable outcomes in faculty retention and advancement (Kreitzer & Sweet-Cushman, 2021; MacNell et al., 2015; Smith & Johnson-Bailey, 2011). For this reason, most institutional policies emphasize that student evaluations should be considered one component of a comprehensive review process, alongside other forms of evidence such as self-reflection, peer review, observation, and professional development activity reports.

Sources of Bias in Student Evaluations

Bias in student evaluations arises from multiple interacting sources, including course, instructor, and student characteristics. Course-related factors such as whether a course is required or elective, undergraduate or graduate level, large or small enrollment, or focused on challenging or controversial topics can systematically influence how students rate instructors (Ho et al., 2009; Uttl & Smibert, 2017; Zabaleta, 2007). These effects often have little to do with instructional skill and more to do with students’ expectations or levels of resistance.

Instructor characteristics play a particularly significant role in shaping student judgments. Numerous studies document that students’ evaluations are influenced by instructors’ gender (Basow & Montgomery, 2005; Boring et al., 2016; El-Alayli et al., 2018), race and ethnicity (Aruguete et al., 2017; Bavishi et al., 2010; Smith & Johnson-Bailey, 2011), age (Joye & Wilson, 2015), attractiveness (Hamermesh & Parker, 2003), accents (Subtirelu, 2015), personality traits (Clayson & Sheffet, 2006; Patrick, 2011), and other aspects of social identity (Boring et al., 2006; Heffernan, 2022b). Women and instructors of color are more likely to receive comments focused on demeanor, tone, or appearance rather than pedagogy or course design (Wallace et al., 2019). Faculty who teach about race, equity, or other socially sensitive issues may also experience student resistance, which may manifest as lower ratings (Harlow, 2003; Littleford et al., 2010).

Student characteristics further compound these effects. Factors such as students’ grade expectations, motivation levels, and prior beliefs about who “looks like” an effective instructor can shape evaluation outcomes (Bavishi et al., 2010; Centra, 2003; Clayson & Sheffet, 2006; Harlow, 2003). Taken together, these dynamics help explain why SET data often reflect structural and interpersonal bias rather than instructional quality.

Double-Bind Experiences and Harmful Commentary

One particularly harmful consequence of biased evaluations is the “double-bind” experienced by many instructors, especially women and faculty from marginalized groups (Bavishi et al., 2010; Chavez & Mitchell, 2020). Students may hold conflicting expectations, for example, expecting instructors to be both authoritative and warm. When instructors conform to stereotypes, they may be penalized for lacking rigor or professionalism; when they defy stereotypes, they may be criticized for being unapproachable or harsh (Lazos, 2012). In either case, evaluations suffer.

In some instances, student evaluations also contain abusive or unprofessional comments targeting instructors’ identities, accents, appearances, or disabilities. Such comments are not only unconstructive but can also cause significant emotional distress, while offering no meaningful information about teaching quality (Heffernan, 2022a, 2022b). These patterns further underscore the need to interpret qualitative comments with caution and to question whether such feedback should be included in high-stakes evaluations at all.

Recommendations for Improving SET Use and Interpretation

Scholars recommend several strategies for improving the use and interpretation of student evaluations (Linse, 2017; Marshik et al., 2023; Stark & Freishtat, 2014). A recent review of the literature by Kreitzer & Sweet-Cushman (2021) examined over 100 articles on the topic and offered several recommendations for equitable use of SET data.

  • First, SETs should be framed explicitly as measures of student perceptions, not as objective indicators of teaching effectiveness. SET instruments should also be carefully reviewed and updated so students can share their experiences in the course, rather than evaluate teaching.
  • Second, results should be interpreted cautiously, with attention to response rates, types of courses, and patterns over time rather than isolated numbers. Comparing faculty to one another using SET averages is especially problematic, as rating distributions are often influenced by outliers and small sample sizes.
  • Third, when numerical data are used, focusing on medians, modes, and trends across semesters is generally more informative than relying solely on means. Variations of a few tenths of a point are common and should not be interpreted as evidence of effective or ineffective teaching. Similarly, global questions such as “Overall, this instructor is excellent” are susceptible to measurement error, offer limited insight, and should not be treated as definitive measures of effectiveness.

Many experts recommend limiting or omitting qualitative comments from summative evaluations because they are often biased and may not reflect the experiences of the broader student group (Heffernan, 2022b; Marshik et al., 2023). Qualitative comments are also difficult to synthesize, and even well-intentioned reviewers can be influenced by novelty bias (giving disproportionate weight to surprising remarks) and negativity bias (tending to remember negative information more readily than positive).

Broadening the Evidence Base for Teaching Evaluation

A more equitable and meaningful approach to teaching evaluation incorporates multiple forms of evidence. Formative methods such as midterm student feedback, classroom observations for improvement, and reflective teaching journals provide opportunities for teaching improvement. Summative evidence can include peer review of teaching and course materials, analysis of student work and learning outcomes, teaching portfolios, professional development engagement, and Scholarship of Teaching and Learning (SoTL) projects.

Formative data can also inform summative evaluation when used thoughtfully. For example, instructors might document how midterm feedback led to concrete instructional changes or how peer observations informed course redesign. These forms of evidence help shift the focus from isolated student opinions to sustained, reflective teaching practice.

Shaping a Strong Teaching Narrative

When student evaluation data are used in review processes, context is essential. Instructors are often expected to provide their own narrative when analyzing data, yet they may underestimate how unfamiliar their colleagues are with their teaching context. A strong teaching narrative explains the instructional context, the student population, course goals, and pedagogical choices. It highlights patterns rather than anomalies and pairs student feedback with concrete evidence of intentional, research-informed teaching.

By integrating multiple sources of evidence and explicitly acknowledging the limitations of SETs, instructors can present a more accurate and equitable account of their teaching effectiveness.

References

Aruguete, M. S., Slater, J., & Mwaikinda, S. R. (2017). The effects of professors’ race and clothing style on student evaluations. The Journal of Negro Education, 86(4), 494–502. https://doi.org/10.7709/jnegroeducation.86.4.0494

Basow, S. A., & Montgomery, S. (2005). Student ratings and professor self-ratings of college teaching: Effects of gender and divisional affiliation. Journal of Personnel Evaluation in Education, 18(2), 91–106. https://doi.org/10.1007/s11092-006-9001-8

Bavishi, A., Madera, J. M., & Hebl, M. R. (2010). The effect of professor ethnicity and gender on student evaluations: Judged before met. Journal of Diversity in Higher Education, 3(4), 245–256.  https://doi.org/10.1037/a0020763

Benton, S. L., & Cashin, W. E. (2014). Student ratings of instruction in college and university courses. In M. B. Paulsen (Ed.), Higher education: Handbook of theory and research (Vol. 29, pp. 279–326). Dordrecht, The Netherland: Springer.

Centra, J. A. (2003). Will teachers receive higher student evaluations by giving higher grades and less course work? Research in higher education, 44(5), 495–518. https://doi.org/10.1023/a:1025492407752

Chávez, K., & Mitchell, K. M. (2020). Exploring bias in student evaluations: Gender, race, and ethnicity. PS: Political Science & Politics, 53(2), 270–274. https://doi.org/10.1017/s1049096519001744

Clayson, D. E. (2020). A comprehensive critique of student evaluation of teaching: Critical perspectives on validity, reliability, and impartiality. Routledge.

Clayson, D. E., & Sheffet, M. J. (2006). Personality and the student evaluation of teaching. Journal of Marketing Education, 28(2), 149–160.  https://doi.org/10.1177/0273475306288402

El-Alayli, A., Hansen-Brown, A. A., & Ceynar, M. (2018). Dancing backwards in high heels: Female professors experience more work demands and special favor requests, particularly from academically entitled students. Sex Roles, 79(3–4), 136–150. https://doi.org/10.1007/s11199-017-0872

Hamermesh, D. S., & Parker, A. M. (2003). Beauty in the classroom: Professors’ pulchritude and putative pedagogical productivity. The American Economist, 44, 17–29. https://doi.org/10.3386/w9853

Harlow, R. (2003). “Race doesn’t matter, but...”: The effect of race on professors’ experiences and emotion management in the undergraduate college classroom. Social Psychology Quarterly, 66(4), 348–363. https://doi.org/10.2307/1519834

Heffernan, T. (2022a). Sexism, racism, prejudice, and bias: A literature review and synthesis of research surrounding student evaluations of courses and teaching. Assessment & Evaluation in Higher Education, 47(1), 144–154. https://doi.org/10.1080/02602938.2021.1888075 

Heffernan, T. (2022b). Abusive comments in student evaluations of courses and teaching: The attacks women and marginalized academics endure. Higher Education, 85(1), 225-239. https://doi.org/10.1007/s10734-022-00831-x

Ho, A. K., Thomsen, L., & Sidanius, J. (2009). Perceived academic competence and overall job evaluations: Students’ evaluations of African American and European American professors. Journal of Applied Social Psychology, 39(2), 389–406. https://doi.org/10.1111/j.1559-1816.2008.00443.x 

Joye, S., & Wilson, J. H. (2015). Professor age and gender affect student perceptions and grades. Journal of the Scholarship of Teaching and Learning, 15(4), 126–138. https://doi.org/10.14434/josotl.v15i4.13466

Kreitzer, R. J., & Sweet-Cushman, J. (2021). Evaluating student evaluations of teaching: A review of measurement and equity bias in SETs and recommendations for ethical reform. Journal of Academic Ethics, 20(1), 73–84. https://doi.org/10.1007/s10805-021-09400-w


Written by Mayuko Nakamura, Assistant Director for Assessment and Equitable Pedagogy, Center for Integrated Professional Development. Last updated 10/24/2025

Data-Informed Reflection

Browse more Data-Informed Reflection resources