Saturday, March 12, 2005

(R) Tabulation 101 for Judges

In my view, the judging process can be sub-divided into two sections:
1. Scoring - done by Judges (Music, Dance, Theatre)
2. Tabulation – done by Award Organiser for all Judging categories

Scoring

Objectivity (30%)
§ Scoring methods differ from category to category (Music, Dance, Theatre) because of the nature of each category

§ Throughout the year, Judges attend performances and enter scores into Score Sheets. Each category has a different Score Sheet. These were developed after much debate on basic principles, parameters, criteria, exceptions, etc. The Score Sheet is also constantly being improved or updated in view of current trends or oversight

§ Judges also adhere to guidelines provided by Award Organiser – Judging Processes and Judging Criteria – which are updated year to year

§ A minimum of 5 judges must attend each performance to ensure that all shows are fairly represented

Subjectivity (70%)
§ The Judges come from a variety of music/dance/theatre-related backgrounds

§ Each Judge has the subjective and creative right to assign a score that he/she deems fit. They also have the right to PMS, bad hair days, and other forms of emotional imbalance due to social/political/economic impact in their personal lives.

§ At the end of the Award Year, all Judges would submit the scores of all the shows they have judged to the Award Organiser

Tabulation

Objectivity (90%)
§ The Award Organiser collects scores from all judges from each category and tabulates the scores with one standard methodology

§ The calculation - for each show, the Award Organiser adds up the scores given by all the judges and divides it with the number of judges who judged the show. The average of the scores is the final score of each show

§ The top 5 scorers for each Award Category (Best Dancer, Best Script, etc) will make it to the nomination list

Subjectivity (10%)
§ The Award Organiser reveals the nomination list to all Judges over a half-day sitting (by category)

§ Award Organiser and Judges debate over the nomination list over grey areas. The nomination list is finalised once both parties come to an agreement


Commonly Asked Tabulation Questions by Judges

1. What happens when there are not enough Judges to judge a performance?

Let’s revisit the minimum number (5) of Judges criterion:

As earlier stated, there should be a minimum number of Judges judging a show for fair representation.

For the mathematical rational, please refer to The Law of Large Numbers. This law states that the larger the size of a random sample, the more it approximates the true population and the smaller the amount of error (and, thus, a smaller confidence interval).

In relation to scoring, let’s say in a competition there are:

100 judges and scores
You can be quite sure that the final score is significant and represents the views of population

10 judges and scores
The scores represent the views of a percentage of the population, so you can’t really be sure that these views represent the entire population. However, in the absence of large numbers, this would suffice

1 judge and score
Whether the scores are high, low or medium, you really cannot be sure at all that this score represents the views of the population. One or two persons should not dictate the views of the population

It does not matter how the judges score (high, low or medium), the number of judges scoring affects the credibility and the accuracy of the final score when tabulated.


2. What happens when there are extreme scores?

Each Judge has the subjective and creative right to score as they please. Hence, the argument on how they score has no basis.

However, this question can be answered objectively. The Award Organiser uses the system of average (mean only) to tabulate the scores.

There are two methods of finding the average of scores – Mean and Median.

Mean – this is used when scores are centrally distributed (scores that are more or less similar).

Median – this method is used when scores are highly skewed positively/negatively (extreme scores).

Judges---------- 1-----2-----3-----4-----5----Mean--Median
Extreme score--28-------------------------------28---------28
Extreme score--10-------------------------------10---------10
Extreme score--28---26---24----27----11----23.2-------26.0
Extreme score--10----9----12----14----29---14.8-------12.0
Similar scores--28----26---24----27---25----26.0-------26.0
Similar scores--10-----9---12----14----15-----12.0------ 12.0

Let’s look at the simulation above. When there is only one Judge (Law of Large Numbers), there is no choice but to take the score that the Judge had given – the average is the same as the given score itself so there is no credibility.

Let’s look at the minimal 5 Judges with one Judge scoring extremely low (11). By using Mean to calculate, the score is pulled down by 2.8 points even though the other 4 judges (Law of Large Numbers) feel that the performance deserves more merit. By using Median to calculate the average, this would give you the true average that is reflected by the majority of the Judges.

Likewise with the one Judge that gives an extremely high score (29). By using Mean to calculate, the score is pulled up by 2.8 points even though the other 4 judges (Law of Large Numbers) feel that the performance does NOT deserve the merit. By using Median to calculate the average, this would give you the true average that is reflected by the majority of the Judges.

Look at the last two samples where scores are almost similar (high or low). Judges are in agreement that a particular show is good or bad. Observe the scores tabulated by using Median. It reflects the views of the majority more accurately (26 at high; 12 at low for both Extreme scores and Similar scores models).

Recommendation: Use Median to obtain the average score, not Mean

3. How come this undeserving performance/individual made it to the nomination list? I did not score him well!

The answer to question 2 answers this somewhat – tabulating the scores by using Mean pulls up or pulls down the true average score resulting in some surprises in the nomination list.

The following explanation bring up another problem:

Assuming that we have calculated all the scores for each performance, we move on to rank the scores. The Average is obtained by using Median.

Judges ----------1--- 2---3---4---5---6---7---8---9---10---Average
Performance 1--22--27-25--26--24-------------------------------25
Performance 2--22--26-24--26--24-21--25----------------------24
Performance 3--22--26-24--25--23-21--25--23--26--21------23.5
Performance 4--
22--26-24--25--23-21--27--25---------------- 24.5

Nomination List------ Scores---- No. of Judges
Performance 1---------------25----------------5
Performance 4---------------24.5-------------8
Performance 2---------------24---------------7
Performance 3---------------23.5-------------10

The scores are rank from the highest to lowest numbers.


Now, let’s look at all the scores from an Algebra perspective (or "Pecahan" rather?) (25/5, 24.5/8, 24/7 and 23.5/10):

Performance--------1---------4---------2---------3

Scores------------------25------24.5------24------23.5
---------------------------5---------8---------7-------10
-------------------------Apple--Orange--Durian--Banana

How can you tell if the actual value of 25/5 is larger or smaller than 24/7; or how can you tell if the actual value of each of these scores are larger or smaller than the other? It would be comparing apple with orange with durian with banana.

First, you need to normalise the base number (compare apples with apples). In this case, the base number is 280.

Performance------------1--------4--------2--------3
Scores (face value)-------25------24.5----24------23.5
-------------------------------5--------8--------7-------10
----------------------------Apple--Orange--Durian-Banana

Scores (actual value)---1400----858------960----658
------------------------------280----280------280----280
-----------------------------Apple--Apple---Apple--Apple
-------------------------=-----5-----3.06-----3.43---2.35
Actual Nomination List

Performance-------------1--------2--------4-------3
Scores (actual value)------5------3.43----3.06-----2.35

Scores (face value)--------25------24------24.5----23.5


Based on this calculation, it is clear that the scores on face value (how the judges score) does not reflect the actual value (considers both how the judges score AND the number of judges scoring).

See how the rank of the performances shifts (Performance 2 is now higher than Performance 4) after the scores are properly tabulated. The shifts can vary from one rank up/down to several ranks up/down. Some good shows are eliminated from/bad shows make it into the nominee list simply because the ranking is done by face value.


3. How about the issue on live judging and video judging?

Some Judges feel that the ‘live’ experience can influence the impact of the performance and hence the scores. While others feel that watching the performance on video enables the Judges to replay moments missed out while watching a performance live. This issue also came about because some Judges are not able to attend the performances but yet still want to Judge.

The first way to resolve this issue is to come to an agreement that Judges should exercise professionalism in Judging regardless of whether it’s judged live or video.

The above is a subjective exercise.

Objectively, this can be resolved by computing the mean and standard deviation of the set of scores obtained by live judging (x-score) and video judging (y-score) and compute the standard scores (z-scores) of the x and y scores.

z-scores allow us to compare scores obtained from totally different methods; they allow for comparisons of “apples and oranges”.

(References:
A big thank you to
· Dr Teoh Hsien-Jin (bet you didn’t know that he’s Meng Jin’s brother!)
· Dr Scott (US)
· and the anonymous fresh graduate of a genius Electrical Engineer

… for vetting through and affirming the simulation/model of scores that I’ve created in my bid to offer one way of tabulating scores. There are of course many other ways to tabulate. Please feel free to offer your knowledge and share your model.

· Mass Media Research: An Introduction
· Investigating Communication: An Introduction to Research Methods)




1 comment:

Buaya69 said...

wow! *buaya takes notes to curi this kau-kau scoring system* ;)