When Seeing is Believing: Generalizability and Decision Studies for Observational Data in Evaluation and Research on Teaching

Thursday 1:10 pm – 1:35 pm PT / 2:10 pm – 2:35 pm MT / 3:10 pm – 3:35 pm CT / 4:10 pm – 4:35 pm ET Online
Concurrent Session

Tim Weston, University of Colorado at Boulder
Charles Hayward, University of Colorado at Boulder
Structured classroom observations are widely used in research and evaluation of STEM classrooms to characterize teaching and learning activities. Because conducting observations is typically resource intensive, it is important that inferences about change involving pre/post and between-group comparisons are made confidently. While much attention about observational data focuses on interrater reliability (the agreement between independent raters), the reliability of a single-class measure over the course of a semester receives less attention. We examined the use and limitations of observational data for characterizing and evaluating teaching practices. We wanted to know how many observations are needed during a typical course to make confident inferences about teaching practices. We conducted two studies based in generalizability theory to calculate reliabilities given class-to-class variation in teaching over a semester. We used 177 in-class observations from 32 undergraduate mathematics courses using the TAMI-OP observational protocol which collects 2-minute observations of 11 student and 9 instructor behaviors. Eleven observations of class periods over the length of a semester were needed to achieve a reliable measure for our data as defined by a reliability coefficient of G = .8 or greater. Additionally, we found that different activity codes varied on the level of rater agreement needed to achieve a reliable measure. Agreement on what constitutes teacher and student questions showed the lowest agreement. The number of observations is many more than the one-to-four class periods typically observed in the literature. Comparing our results to other studies for the amount of class-to-class variability responsible for reliability suggested that our data were slightly more variable than other studies, but needing eleven observations was not an outlier. Findings suggest practitioners may need to devote more resources than anticipated to achieve reliable measures and comparisons in research and evaluation that incorporates observational data.