false
Catalog
Educator Edge: Using Psychometrics for Quality Ass ...
Exam Psychometrics
Exam Psychometrics
Back to course
[Please upgrade your browser to play this video content]
Video Transcription
module and final installment in this series about writing and administering effective examinations and in this final piece we're going to talk about something that's very important which is to provide the quality assurance for your exams. So what our main objective here is going to be to apply psychometric principles to evaluate the validity of your exam items and really we'll talk a little bit about evaluating the exam quality overall. Having been a professional educator for many many years I've had a lot of opportunities to write really good and not so good exams and exam items and so I'm going to share some of that with you today as we talk about working through the exam after the fact looking at the exam and how it performed and and now that we've talked about the purpose of testing and how we construct the exam we've got to now ask the question did the exam do what we think that it was going to do or what we intended it to do of course and so that that's going to be our main goal here. So as a little bit background and we'll talk about some terminology here there are various ways to measure things right when we think about very straightforward measurement we measure mass and we can measure the dimensions of something we can count something to measure the number when we talk about psychometrics we're talking about the science surrounding measurement of things that have to do with the human psyche like intelligence and attitudes beliefs personality traits things like that and so if you think about this just to sort of frame this thinking think about IQ tests and personality tests and aptitude tests and things like that there are a lot of different ways that we can approach that sort of measurement and that sort of evaluation of these traits we have written exams which is the main point of our conversation here but we also do some observation we can measure things through observation such as in simulation if you do simulation based education and in those cases as you know your intention is that you are evaluating you're watching a person's performance and evaluating the quality or the effectiveness of that performance but there's a lot of error built into that let's say that the saturation is dropping and student number one notices the the SAT dropping in the first five seconds and student number two takes 10 or 15 seconds to notice it there's there's a real big gray area there is is one of those absolutely right and one of those absolutely wrong is the second student unsafe or or is that a failed performance because that they were a little bit different in the way they approach that we do different inventories like the Myers-Briggs type indicator you probably have taken now we do some projective techniques like the Rorschach tests think about all the variability that can exist in those types of measurements so when we talk about psychometrics the whole point of it is that it's not as straightforward as pulling out a ruler and saying this 2x4 is exactly 3.2 feet long or 3 feet and 4 inches long or whatever the measurement would be we should be a very discreet very objective measurement when we talk about measuring humans abilities and of knowledge and performance and things like that there's a lot of gray area in that and so that's why it's not as straightforward as to just say we're gonna write the test questions let the student take the exam and then we're gonna say that they are either 80% of the way to understanding this content or not because we've got to assume that there's some error built into all of that and some of that error as we'll talk about today we have to assume came from the fact that we have written the exam questions in a way that they're not doing exactly what we thought they were going to do which is to absolutely and objectively tell us this student knew the material and this student did not know the material we think of an exam as evaluating the performance right we think that that you know I'm gonna write this exam and that's gonna tell me what the student knows or does not know and and that you know would would be really predicated on this idea that the exam itself and the exam items are perfectly written and as we talked about in the previous module about the all the the conventions and the approaches to writing exam questions I'm sure it became very apparent to you is as you thought about that all the different ways that you could go wrong in terms of how the exam questions are written so so when we think about quality assurance of our exams we really need to use this as educators to evaluate the performance of the exam itself so not just to assume that the exam is perfectly formulated to evaluate the student but then we have to take a step back and say we as educators also need to evaluate the performance of the exam you will hear this term item response theory and I'm going to talk just very briefly about comparing classical test theory and item response theory because some of the measurements that we'll talk about go into item response theory and so I want to lay this as a groundwork but but but but a lot of these topics and this psychometrics around these topics has a lot to do with statistics and it goes very deep into all sorts of statistical measurements which is really beyond the scope of what we're trying to do here in terms of this series so I'm gonna try not to go too much into all of that stuff the classical testing theory which has been used for over a hundred years it just assumes that the total score that the student achieves on the exam is the sum of both the true score what they knew and did not know and some error there's an assumption that there's some error built into the exams I think that's just a really good sort of fundamental framework to think about when we think about exams the classical theory does put a little bit more emphasis on the overall exam performance as opposed to looking at each individual item and it also makes an assumption that the population sampled is kind of standardized and so that the population from one that you know each population that takes a given exam are all kind of similar in item response theory the viewpoint is just a little bit different the idea is that each item has this distinct probability curve describing the probability that correct or incorrect response will be given based on the overall ability of the examinee the item response theory uses something called a Roche model that defines the difficulty of each item and I'll throw that in here just because it's used in very complicated sort of exam platforms not typically the ones that are used in anesthesia programs but for a certification exam for example the the adaptive exam is highly dependent and in fact it is absolutely dependent on that Roche model assignment of difficulty of each item because as the examinee takes one item that whether they pass that correctly or get that incorrect then the computer needs to to deliver the next item of either a higher or lower level of difficulty which has already been determined and tagged to each of those questions in order to lead the student through that adaptive model of exam okay so let's talk about some of the measures of the individual exam items the first we'll talk about is difficulty and difficulty really has two aspects to it one when employed in a very complex testing platform such as what it's used for a national certification exam it there is a number that is assigned to each item that describes how difficult it is to achieve a 0.5 probability or 50% probability of a correct response for that specific item given the respondents level of their their latent variable theta level as is used in in those very complex statistical measures for us a more more generically speaking and more familiarly speaking in terms of probably what you could do in your anesthesia program is we could just say that we could use the p-value in other words that we use that as a proxy to say what is the proportion of people that got this exam question correct versus incorrect when I administer the exam so we can think of it as the basically the probability that that this exam question was answered correctly on our given exam now obviously there's a little caveat to that there's a there's a little condition there that that makes an assumption that it is a well-constructed exam item and so we'll talk about that piece and the quality piece in a minute but let's just make the assumption for now that the exam item is written correctly and it's doing the right thing we can make is this kind of proxy to say that the more people that get it correct probably the easier of a question it was or a lower difficulty level it was the less people that got the crest question correct probably the harder or more difficult level we could assign to that given test item so again when we talk about very complex exams like a computer adaptive exam there's an actual computation in in degrees in degrees called logits that that have to be tagged to each individual exam question for us we're going to just basically say it's going to be reflected in the amount of people that got it correct or incorrect now I'll give you a quick little rule of thumb as we look at the proportion of people that got a given question correct on our exam is that we would like ideally the difficulty level to be pegged somewhere between halfway a little bit higher than halfway between the level that would be attributable to chance and of course a hundred percent of people getting the question correct so in other words if we have for example a five response multiple-choice question the chance that someone would get it correct just by guessing would be one out of five or twenty percent that would be the lowest level that we'd anticipate would be if the question was written correctly again that'd be the lowest level we'd assume would ever be you know reflected on an exam the highest level of course would be 100% meaning everyone who takes that question that looks at that test item gets it correct halfway in between there would be 60% and we want the difficulty level to be set just a little bit higher than that so somewhere between 60 and 70 percent would be an ideal difficulty level for a five response multiple-choice question so when we administer this question on an exam if we see that 70% got it correct just in terms of evaluating the difficulty with level we'd say that's a pretty good exam question if we had a more traditional more typical for response multiple-choice question the ideal difficulty would be about 75% we'll talk in a little bit about how that that ideal gets shifted a little bit with norm referenced versus criterion referenced exams so the p-value proportion of examinees who answered the item correctly is simply a proportion correct so again the the problem with going too far to the extremes of that is that you know if you say gosh only 75% of people got the question right how is that a good thing well if it's greater than 95% then essentially that question is not providing much discriminatory ability in other words back to module 1 if you if you were with us for module 1 we talked about the purpose of testing in the first place if we're using testing to determine some ranking of people and say these people really knew this stuff very well and these people need to to use the exam to provide feedback to tell them that they missed some certain points or miss some concepts if everyone is getting a given item correct it's not really telling anyone in the class that they've they've not done so well that's not always a terrible thing because in our programs for example with with the exams that are that are administered in our programs there are things that we might certainly really hope that every single student gets and so it's not always an absolute bad thing but if you administer an exam and every single student got every single question correct you probably are not getting a whole lot of benefit out of that exam in terms of being able to give feedback to the student for you to evaluate student performance between 0.6 and 0.9 for a given question in other words 60 to 95% people got the question correct is a pretty good range for the typical type of questions that were being given and if it's less than 0.5 it just kind of demonstrates it's a very difficult exam item not necessarily bad we'll talk about the other measures of quality but but but you know that that's a test item that was pretty difficult again we're going to interpret that you know there's no absolute to this and we're going to interpret that based on the expectations the examinees so we might say on one hand people that are very new very novice in a field we might expect them to perform lower on an exam we might expect lower proportions of people getting given items correct or incorrect if there are certain things that we say well this is a really hard and fast super important rule like don't you know if someone has a history of MH do not turn on a volatile anesthetic we would probably really want 100% of our students we'd probably really happy if 100% of our students knew that point and were able to get that question correct on an exam so so there's sort of a difference in examinee expectations and then also criterion versus norm referenced exams and so criterion referenced exams are ones where we just evaluate the performance based on the how the examinees responded to each individual item on a norm referenced exam which you see more so unlike the NCLEX the bar exam the national certification exam you're trying to get more of a spread you're trying to have an exam that has kind of probably generally a bit higher difficulty level of each individual item and you want to see more of a spread because you're trying to use this exam to kind of weed out you want to see who has met the bar literally with a bar exam who's met the the pass point if we're talking about the certification exam or the NCLEX exam and who is below that with a criterion referenced exam such as in our programs if we teach something well and we feel like it's all very straightforward information and maybe a very advanced student we might really expect yeah they should know most all of this stuff so so there's going to be differences there's not going to be a real absolute measure or an absolute standard in terms of the proportion of students who should get a given item correct okay let's look at some measures then of the actual quality of the exam items themselves and how well the exam items are doing what we intend them to do which is to essentially tell us and tell the examinee you're doing really well or you're not doing very well in terms of mastering the content the discrimination index and the point by serial are very related measures of how well the exam item is performing to tell people that overall you know they're a higher ability or lower ability candidate we use that sort of parlance in testing terminology we don't say smarter kids versus not smart kids we say that they are a higher ability or lower ability candidate just to get that that little piece correlated so the discrimination index and point by serial both make a correlation of scores on this individual item and scores on the exam overall so in other words if if I'm a higher performer if I'm in the top 27 of the entire class on this exam overall I'm gonna tend to get this question correct that would create a higher point by serial or discrimination index for that if I'm in the lower 27% of the of the cohort from in the or the lower proportion of the cohort in general I'm gonna tend to get if I tend to get this question incorrect again that's a higher correlation so that means that this item in particular is is being discriminatory it is telling me the the person who's administering the exam who in the class is tending to do better overall or do worse overall the the small little subtlety between the two particular measures discrimination index versus point by serial is that the point by serial compares to all test takers so it compares the performance on this exam item compared to the entire class or the entire cohort on the test on the test overall the discrimination index compares only against the top 27% and the bottom 27% so a high correlation so that it has a range of negative one to positive one a high correlation would mean that if I tend to do well in the exam overall I will tend to get this question correct and if I tend to do worse on the exam overall if I tend to have a lower overall score I will also tend to have gotten this particular question incorrect that would make a high point by serial or a high discrimination index meaning that the test item itself so it's just a measure of this particular test item you know test question number seven for example it tells me how well that particular item does in terms of discriminating between higher performing and lower performing examinees anywhere above point two is a good correlation so when we look at the point by serial or the discrimination index just depending on whichever system you use you know whatever testing system you use what kind of statistics it spits out for you we want to see this discriminatory ability to be greater than 0.2 if it's less than 0.2 that means that there's less than 20% difference between the top scoring and the low scoring groups within the given class that responded to the item correctly so in other words that particular item did not really tell us very much or tell the examinees very much about whether they were kind of more or less on the ball with this content if it's between 0.1 and 0.2 when we evaluate our exam items it's something that it's okay but it could be improved upon and we could increase the discriminatory ability of that test item by looking at maybe the wording of it by looking at the choices and we'll talk a little bit more about how to get into that a little bit deeper after a little bit if it is less than 0.2 we probably definitely need to revise the item then we're gonna say that this is really not doing a whole lot of anything for us and if it becomes negative if you see a negative point by serial that means that the exam item is actually doing exactly the opposite of what you want to do you you want it to be able to tell you this person is you know higher performer or a lower performer if it is negative what that means is that the people that scored lower overall tended to get the question right and the people that scored very high overall tended to get this question wrong which essentially sort of defies logic and and the logic that would be behind that kind of interpretation would be that it is maybe miskeyed or that is confusing or it is just plain incorrect because the people that sort of knew all this content really well got this question wrong it's sort of inconsistent so we can use this point by serial then to help us to see how well the individual test item is performing the point by serial and or discrimination index is lost at extremes of the p-value so in other words if a hundred percent of students got this question right that might not be a bad thing that may be a good thing but there's there's no discriminatory ability there's there's nothing about the performance on that that said hey these are the group that you know are doing better on this exam and these are the group that are doing lower and conversely if everyone gets it wrong or most everyone gets it wrong it also does not provide much discrimination so even if the high performing students or high ability candidates got this one wrong and the low ability candidates got this one wrong all it really tells us is that the item itself is not really doing much of anything to you know it might be just really super difficult and everyone's getting it wrong and so it's not really telling doing anything to help us make that division or make that stratification within the cohort to say who's doing better and who's doing less well now we can put those two together the p-value as a surrogate for the difficulty and then the discrimination index we can put those two together because for example if we see a p-value this very low Wow you know test question number seven only 20% of people got that question correct so there's there's two interpretations to that right at the face value one is that it was a good question but it's super hard the second one is that it wasn't a good question at all and a lot of people missed it because it was confusing or is miskeyed or something like that so if we put together the difficulty level and the discrimination index or the p-value in the discrimination index then we can have a richer interpretation so if those equate to greater than 0.8 we'll say that's actually ideal 0.7 and 0.8 is decent but if it's less than 0.7 not so good so let's walk through this step by step so I said that on a four response multiple choice question the ideal difficulty would be about 65 in other words about 65% of the of the cohort would get it correct that again is halfway between 25 and 100 and halfway plus you know add a little bit onto that the acceptable point by serial we said would be 0.2 or higher so if we say 0.65 got it correct add the 0.2 for the vice point by serial we'd get 0.85 and we'd say hey it might be you know a low number of people got it correct but it was it was very discriminatory it had a high discriminatory index to tell us who's doing better and who's doing worse when we put we did the difficulty plus the discrimination index together we got 0.85 and we say that's really ideal that's a really good question again if you have a question where everyone's getting right or everyone's getting it wrong that's not very good if you have a question with a very low p-value 20% of people got it right and the point by serial is negative well then that's that has us at like you know a sum of 0.1 so it's not necessarily a bad thing that a lot of people don't get a given question correct if the point by serial is a high number in other words if it is doing a good job of discriminating between higher and lower ability candidates all right so if we look at some some exam items for example here one by one and these are some real statistics I pulled just from a random exam that I've administered number 17 for example the p-value was 83 83 percent of people got it correct not a very difficult question of the upper 27 percent performers the people that tended to do well the top quartile in the exam overall 100% of them got the question correct in the lower 27% the bottom quartile of overall performers only 67% got that correct that gives this exam item a pretty good point by serial that says that if you're a high performer overall you've got a high likelihood of getting this correct and if you're a low performer you've got a much lower proportion or much lower probability of getting it correct so this exam item has a really good discriminatory index discriminatory index 0.33 and the point by serial came out 2.66 again depending on whether it's just a calculation difference based on whether it's looking at the upper and lower quartile or whether it's looking at the exam cohort overall if you look at question number 18 here, the second one that's in red, 100% of people got it correct. Now that's not necessarily a bad thing, but of the top performers in the class, they all got it correct. Of the bottom performers in the class, they all got it correct. So that doesn't necessarily mean that it was a bad test question, because it might be a really important concept that I could feel really good that all the students got it and knew the answer. The one thing that this question does not do for me is it does not provide any discrimination. It does not tell me or any of the individual students who's doing better and who's doing less well on this content area. So it's not necessarily bad to have some questions that look like this. If the entire exam looks like this, then you might be spinning your wheels a little bit. You might be administering a whole exam and not really getting a whole lot of feedback in terms of student performance. Makes everyone feel really good though about things. So the point by Cyril is zero. That doesn't mean that it's a bad test question. It just means that there's no discrimination in this given item that says who's doing better and who's doing less well. If we look at number 22 there, the second box in green, the P value is 63%. So now less people got this one correct. And so is that because it was a confusing question or was it just a lot more difficult than the other ones we looked at? Well, if we look at the top quartile and the top 27% of performers in the class overall, all of those students got this question correct. If we look at the bottom quartile, the bottom 27% of the people who scored the bottom 27% overall on the exam, none of that group got the question correct. So that gives us a very high discriminatory index. In fact, look at the discriminatory index that just looks at those two quartiles is 1.0. Means that this question is perfectly suited to tell me is any given student that takes this question a top performer or a bottom performer? And you see it also has a, even looking overall at all the scores, it has a very high point by Cyril of .65. So this question is doing a very good job of telling me who's doing better and who's doing less well. If we look even more specifically at the individual response choices, 10 people chose choice C, which was the key to choice, but of the bottom performers, they all chose something other than that. Some of them chose choice D, some of them chose choice B, none of them chose A. Now remember, I'm just pointing this out because we're gonna go back to a point that's really important about writing your test questions. The distractor should be plausible. So it should be that when people look at a multiple choice item, that they don't say, well, it's gotta be C because A, B and D don't make any sense at all, or they're absolutely clearly wrong. That really does not make it much of a choice, a multiple choice, when we talk about multiple choices, we're not really giving them much of a choice where we make it very obvious. So you really like to see this in a quality test question. You'd like to see that some people chose some of the distractors other than the key distractor, because that means that you've written a really good test question where the other distractors definitely look plausible, but obviously have to be incorrect. And how do I know that those are actually incorrect? Well, again, because looking at that point by serial in that discriminatory index, none of the lower performers got this correct, but all of the top performers got this question correct. If we look down at number 31, 94% of the people got this correct. So you might say, wow, this might be another one of those great topics where it's just a clear cut, it's very straightforward, it's an important topic and everyone got it correct. But if I look at the discrimination index here, you see it's negative. The discrimination index is negative 0.25 and the point by serial likewise is a negative number because of the upper 27% performers in the class, only 75% of those got this question item correct. And of the lower performers overall, 100% of them got it correct. When you see a flip-flop like that, we see negative point by serial where the lower performers tend to get it correct and the people that are really high performers tend to get it incorrect. That's a great example to a great little red flag to look at whether the thing might be miskey. Maybe your item, you know, I've got this keyed as B as being the correct choice, but C might really be the actual correct choice. It might be a miskey, it might be something confusing in the question itself. The other caveat that I'll give you as you think about these statistics is also thinking about your sample size. Because as you see here, 15 people got choice B, only one chose choice C. Now that probably means that one person who chose C instead of B happened to also be one of the top 27% performers of the exam overall. So the N value, when you're looking at this in terms of one cohort or one group of people, if you've got a class size that's under 50 people, you do have to take a little bit of a grain of salt as you look at what these statistics really mean. So just keep that piece in mind also. Here's some more kind of representative examples as we would look at some test items. As we look at these statistics, we're gonna look for some benchmarks about performance. And so I've got four different examples here that kind of lead us down different paths. So if we look at number one, the difficulty, the P value is 94, 94% of the students got it correct. So naturally that's gonna show you less discrimination in general. If everyone's getting the test, the question or the item correct, then it's not really gonna tell you very much about who's a higher and who is a lower performer. So as expected, the discrimination index is 0.04%. The point by zero is 0.003. So that's what we of course anticipate that it's not gonna be very discriminatory. And so it's not gonna tell us a whole lot in terms of who's doing better or who's doing worse. In example two, that item's got a 64% P value. So 64% of people got it correct. It's got a discrimination index of 0.2. That's really much more positive. It tells us that this is making a separation for us. It's taken the higher performers are tending to get it correct. The lower performers are tending to get it incorrect. When we add the difficulty, the P value and the discrimination index, we get 0.8. Remember we said greater than 0.7 is really ideal where we'd like to see that to be. So that's a great performing exam item. In example three, we've got even a higher difficulty because only 55% of people got this correct, but the discrimination index went down a little bit from example two. So the discrimination index is now only 0.15. When we add those together, the P value plus the discrimination index, we get 0.7 or 70%, still really in a good range, but not performing quite as well. So we can't just take it at face value. We can't just say 50% of people got this question right or wrong, and so it's a good question or not a good question. You really have to kind of look at that bigger picture. And then in example four, we see that only 44% of people got it right. So you might say, well, either it was a really difficult question or it was really confusing and not a good question. The discrimination index now is 0.43. So this is a high. So we've seen out of these four examples. So this one is really showing a differentiation between the higher performing and the lower performing students. When we add the P value and the discrimination index, we get 0.87. So this is doing a great job. This question is doing a great job in helping us to separate out the ability level of our different candidates. And that gives great feedback to the candidates as well. Now we can dig deeper and deeper into this. And I'm gonna just show just a couple of points on this without going too deep into it. But as you do this exam review, the other piece as you're, if you see a point by serial that's low and you say, well, this question is not performing well, is not being very effective in discriminating the higher from the lower performing students. The way you can dig deeper into that is first of all, just at face value, you could just look at the STEM and apply those conventions we talked about. Is it straightforward? Is it, does it lead the student to, here's what I want you to answer. Is it devoid of extraneous information that could be confusing or could introduce some of that or construct irrelevant variance. And then the other piece is that we can look at the individual response choices. So you might find, for example, maybe you think the B was the key to choice or key was the correct choice, but you might find that a lot of people of the higher performance are choosing D. And if you take a look at that and question why that is, you might find, oh, there's actually a textbook that says that D is the correct answer to this particular concept. So if we look, for example, at number five on my example here, and again, this is a real example taken from an exam that I administered, that it's got pretty good stats. The p-value is 0.86. It's got a discrimination index around 0.5. And so you can delve deeper into the choices and see that A and E are not very discriminatory because everyone chose that, whether they were high or low performer, they all chose A and E because those both had 100% select. Everyone in the class chose A and E, but it was A, C, and E were the three choices that had to be selected for this question. And when you look at choice C, it only had 85% selected it. So that particular choice had a point by serial of 0.53. So in other words, it was that particular, out of the five choices that people had to choose from, it was really only choice C that was making the difference that the higher performer people knew that that was the correct answer. The lower performing people tended not to think that was the correct answer. So you can see that choice C in itself, as a standalone response, had a high point by serial because that was really the one discriminatory piece that was making a differentiation between the ability level of the different candidates as they answered this question. Some other things that you can look at, you can look at response time on the far right column there, the average response time for a typical four response, multiple choice question is about one minute. Since this was a multiple correct response, it took about a minute 30 on average, that's not bad. You might find if a test question is performing not very well, look at the average response time. And if it's taking a long time, it might mean that the stem is too wordy or that there's something inherently confusing about it because the students are spending a lot of time kind of maybe batting back and forth between two different choices that both might have some element of being correct to them. You can also evaluate drift. And if you look at this example on here, there's historical, there's performance history. And so if you look at the p-value of the current cohort or the current 21 test takers that took this, and then look at the performance history in terms of percents that normally got it correct, you can see if the test question is either becoming more or less difficult. Sometimes they become less difficult because people hear about them, they get exposed too much and people talk about, oh yeah, there's a question they always ask about this particular scenario. Sometimes they get more difficult because it's something that we're not maybe teaching as well. And sometimes very importantly, science changes. And maybe what was the right thing to do at the time you wrote the test question, now we're five years down the road and there's something new that's out in the literature and it's talked about and so students are getting it wrong. So it's good to be able to look back at some of these statistics, these longitudinal statistics as well, as you're trying to delve into the test questions as you're doing this review to see what's performing well and what's not performing well. So there are ways to slice and dice and try to dig deeper into the questions themselves to see why, for example, a question might not be doing well. Now, let's say that we do this review and we find that an exam item is not doing very well. What do we do with it? So we go through our exam, we say, hey, these are the ones that have some low performance and low point biserials and they're not being very discriminatory. Maybe they're actually sort of wrong in one way or confusing in one way. Those are the things to look at, the STEM, the item content itself, the validity of the content, the curriculum, did we not teach this? Did we miss this piece? Was this not taught in the correct way or was this not given the right amount of play in the classroom? And then the second thing is then, what do we do with that? So for those questions that we feel are just not really very valid, we say, well, a lot of people missed this, but it was the higher performers are missing it. So it had a low point biserial. It was not really performing very well as a test question. What do we do with it? Well, you can drop the question. That's obviously a real simple thing is just to say, well, we're gonna take it out. One way this can hurt students, especially with an exam that has a lower number of questions overall, is that it adds the weight of every other question. So it's essentially going to, most importantly in the student's eyes, it's gonna add weight to the questions I got wrong. And so what you might do instead of that is you might also think about accepting all answers. You might just say, I'm just gonna give everyone credit for this exam item. And that does not then change the weighting of all the other test questions as it would if you just dropped a question. The other thing that you can do is that you can assign bonus status to it. So sometimes I'll find a test question, maybe I'll have a very low p-value. Maybe I'll see a question that performed with only 20% of students getting it correct. And I'll look at the point biserial and it'll be really high. So it's a good discriminatory question. It's valid. It's not that it's confusing. It's not that it's keyed incorrectly. It's just a really hard concept and not many people got it right. If it's not something that I think is really key or should have been known or maybe something that maybe is a little bit more obscure, you can make that a bonus question. And what that does is that that still gives credit to those students that got it correct. They get credit for having been in that really high little echelon or the small handful of people that actually knew the concept and knew how to get the item correct. And it does not penalize the other folks that did not get it correct. So these are some different things that you can do. If it's a negative point biserial, that's a good example or a good reason to drop a question or just give everyone credit for the question. Again, if you don't wanna re-weight or lead to a re-weighting or recalculation of all the other values of all the other items on the exam. You might find that there's an incorrect distractor. Maybe there's one of the distractors was leading people to choose that and it was incorrect. Also think about revision and revise that for the next time so that you can move forward and not kind of have the same issues from year to year. And then think about, I hope I'm giving you the sense of the work that it takes to cultivate good test items. Think about how you release the scores and how you then communicate that feedback to the students. You might ask the question, do I release the overall exam stats? In some of these systems, it'll send the responses, I mean, the results of the students and say, the class average was 75 with a range of 60 to 100. That can be good or bad. And for a student that got a 78, they might feel terrible about that until they see that the class average was 75. But also if it says the range was 65 to 100 and that particular student has the 65, they also know that, wow, I'm the lowest performer in the class. So give some thought to that. What's the best way in terms of your cohort and what do you think will give them the feedback they need without causing undue stress or anxiety? And then importantly, a really important consideration is how you give people the feedback on the exam. Do you give them the entire exam back? So as a new educator, you know, everything was pencil and paper and they took the exam and we just graded the exam and we would hand the exam back for them to see it before we had anything on computers. The big problem with that, obviously the students enjoy that because they get to see exactly what they did and what they did not do. A couple of problems with that. One is that sometimes it gets students too focused on just that one individual question. They can really focus on the one individual question and why they got that question right or wrong as opposed to recognizing bigger picture, I'm sort of not doing great with Peds or I'm not doing great with calculations or something like that. Maybe the other big, definitely the other big problem with giving the entire question bank back to the students is that you get exposure of the items. And once again, people will chit chat about the items, they'll talk about things, they'll talk about, oh my gosh, you got that question wrong, I got that question wrong too. Hey, remember that question about the Peds and the sucks and three year old kid. And before you know it, then it becomes kind of natural conversation and it becomes conversation that drifts to other students and other cohorts. And then all the work that you put into tweaking and revising and fixing and evaluating these test items, the point by zero gets very low because everyone starts to get it right because it's had too much exposure. So one of the ways that you can get around that, especially with the testing systems that most of us I think are using now is that they often will give you opportunities to give different kinds of feedback. If you start correctly with the way you've put this together with an exam blueprint, thinking about what kind of questions need to go on the exam, you also will have then tagged your exam items to the different category levels that they belong to. And one of the options that you have then is to give the feedback to the individual. In this case, student John Doe gets this feedback on their gas machine exam and they can look at it and say, well, circle system and endotracheal tube, I got both of those questions right, I got 100% of those right. Gas calculations, I'm doing pretty good with math, I got all those questions right. Wow, what did I do here with anesthesia machine systems? Oh, not so good. That tells me that it's not just about the fact that I didn't know the inspiratory valve does this or that, it's this more general kind of guidance that you're giving the student to say, here's the area that you need to spend some more time. I did terrible with troubleshooting. Wow, look, I got zero questions right. Now again, zero out of two, but still it's zero out of two. I didn't get either of those questions right, so maybe that's the area that I need to spend some more time because it's not about that one valve, right? Because when they're in the operating room, some other problem is gonna come up. And it's good to, I feel like it's really important to give them that big picture. Here's the big picture of where you're doing well and where you're doing not so well. So if you tag the questions as you're building the questions, it's better to do this on the front end than to go back and try to go through your whole question banks and do this. But if you tag the questions on the front end, it will number one, give you some great feedback that you can give to the students. And also this is a bit of a really useful thing for curriculum evaluation as well. So you can look at yourself and here's another version of that. This is not by the individual student, this is based on the exam overall. And it says in this category, on this exam, there are 14 questions that fall into this particular category. Out of our curriculum categories, out of the NCE categories under basic science or basic principles, there's nine questions there and there's one question under basic science. And I can look and see in the big picture, out of the 270 people that took these questions, how many tended to get them correct and how many tended to get them incorrect. This KR-20, it's the Kutter-Richardson-20, it's a measure from zero to one that indicates from zero to one, perfect reliability. How consistently does this exam item perform from year to year to year or cohort to cohort? And that's another good way to see if an exam item is drifting, if it's becoming more easy because of exposure, if it's becoming more difficult because the science is changing or it's becoming more difficult because it got squeezed out of the curriculum. So there are a lot of different ways and a lot of different levels from all the way down to the individual item, all the way down to the individual responses on the item, back to a really big picture of, let me look and use these statistics to see how well our curriculum is performing in terms of judging it by how well the students are performing on these items from one exam to the next. Reliability index is kind of another related to that, KR-20, that Kutter-Richardson, just basically says how consistently people tend to get things correct or incorrect from cohort to cohort. And you can also see things like this, looking at the histogram of how the students performed overall. This is a nice, being able to see a nice normal curve because this tells me that a lot of students are performing adequately, some on this exam are getting really high scores and some are getting really low scores. So the exam is doing a good job of being able to tell students, you're performing up here, you're performing down here, you're performing right in the middle. If everyone's getting 100%, that's nice in some ways, makes everyone feel really good, but also does not give you a whole lot of feedback in terms of how your class is doing and who needs more help and who needs less help. So there are lots and lots of variants of how these things can get sliced and diced when you get the output from your testing systems. But again, they really rely on, as you're building the exams, just put the extra little work in on the front end to tag your questions with the content areas because it can give you a lot of great feedback and can give the students a lot of great feedback as you're trying to give them ability to evaluate their exam without overexposing your exam items. So as you think about what we talked about with building the exam items themselves, recognizing that effort that's taken to create good items, anticipate that there will be errors, anticipate that there will be drift inherent in your exam items. So it's not enough to say, I've written a question, I'm gonna assume that it's gonna do 100% perfect job of differentiating high and low performing students. And also, I can't assume that it's gonna be just as good this year as it is the year before and the year before that. We need to anticipate that it's gonna take some effort to cultivate a quality exam pool. It takes review of the questions, look at these statistics. It can be a really quick kind of review. It doesn't take lots and lots of time to glance over the p-values and the biserials or the discriminatory index and to see which ones are performing well and which ones are not performing well. And in the end, the whole purpose of giving the exam is that you're trying to give the students feedback and yourself feedback how well the students are performing and mastering this different material. If the exam items and the exam itself is not doing that, you're going through a lot of effort and you're creating a lot of bad feedback and incorrect feedback to those students and to yourself. So you wanna review that exam performance to improve the effectiveness of the exam and doing what you want it to do. And of course, to improve the fairness to the students so that that 90% really reflects that the student was 90% of achievement on this given content. So that's a very quick overview, but it really gets to the nuts and bolts of the really important pieces of what you need to do in doing that quality assurance review of your exams. I hope that that's useful to you and helpful to you. And I really appreciate that you've been participating in this lecture series. Hope it's been useful to you as an educator.
Video Summary
The final installment in the series on writing and administering effective examinations focuses on providing quality assurance for exams. The main objective is to apply psychometric principles to evaluate the validity of exam items and overall exam quality. The speaker, a professional educator, shares insights on the importance of evaluating exam performance and feedback to improve teaching strategies. The discussion covers measuring difficulty and discrimination indices of exam items, statistical measurements, reliability, and ways to analyze student performance data. The importance of tagging questions with content areas for better feedback and curriculum evaluation is highlighted. Methods to assess exam items, identify errors, and enhance exam quality are suggested, including dropping or adjusting poorly performing questions, giving bonus status, and providing constructive feedback to students. The process of reviewing and improving exam items is crucial for fair and accurate evaluation of student knowledge and performance.
Keywords
quality assurance
psychometric principles
exam validity
performance evaluation
difficulty indices
curriculum evaluation
exam quality improvement
student feedback
10275 W. Higgins Rd., Suite 500, Rosemont, IL 60018
Phone: 847-692-7050
Help Center
Contact Us
Privacy Policy
Terms of Use
AANA® is a registered trademark of the American Association of Nurse Anesthesiology. Privacy policy. Copyright © 2024 American Association of Nurse Anesthesiology. All rights reserved.
×
Please select your language
1
English