If you’re like me or anybody else who travels frequently (or perhaps just lives in a particularly large city), you’ve probably become familiar with yelp, foursquare, or any of a number of different sites and apps designed to help you find somewhere to eat or other goods and services in the area around you. You’ve probably also become quite familiar with the standard sorting options they all provide: Distance, Rating, Cost, and Popularity.
You’ve probably also noticed certain limitations to these options. The closest places might be particularly low on the list in other areas. Maybe you’re looking for a balance of cheap and highly rated. As you read this, you are probably already thinking to just average the numbers, and that would be a pretty good idea.
There is a bit of an issue with this though, namely scale. An average (also known more specifically as an Arithmetic Mean) is the simple sum of a set of numbers divided by the count of that set. This is the mean we’re most familiar with because we’re usually trying to figure out what a typical value would be for a specific measurement, like an average grade, average employee salary, or average product listing. All of these sets of numbers have a similar scale: grades are 0-4 points, salaries are in the thousands of dollars, and product ratings are 1-5 stars. This makes numbers within the set meaningful in comparison to each other, and so the average would have meaning. A difference of 1 or 2 is huge for grades and ratings, but would just get lost in the rounding for a salary.
Let’s look again at the measure we’re trying to sort: Distance is going to be a number range from 0.1 to possibly over 100, depending on the particular app. Rating will probably be a 1-5 score or a percentage, cost could be either 1-4 dollar signs or an actual average price (tens of dollars), and popularity could go anywhere from 0 to thousands. If you just did the arithmetic mean, the distance and popularity would dominate the result and skew the rankings. The correct mean to use for this sort of comparison is the Geometric Mean. Geometric Mean is the second of the Pythagorean Means (after Arithmetic and before Harmonic), and is calculated by multiplying the numbers together and taking the count’s root of the result. So the result would look something like this:
The nice thing about this mean is that changes in the smaller measures have just as dramatic an effect on the result as changing the big ones. So now when you’re comparing all those measures, a particularly high combination of measures will result in a particularly high ranking. You can also weight the measure by raising them to a specific power and changing the root from the count to the sum of the weights. Since the calculation is straightforward, it can be implemented easily in SQL, MDX, actual code, or even excel functions. It also has an interesting relationship to the arithmetic mean: The closer all of the numbers are to each other, the closer the two means will be to each other.
A while back I was working on a project that needed a self contained demo to show to clients that would run smoothly on a laptop without a network connection. Since this meant the amount of data couldn’t be too large, we had to limit the providers to just the biggest ones. The problem was defining the biggest ones. The main indicators of size were the total cost of procedures on the claims (usually hundreds of thousands of dollars), the number of procedures claimed (possibly in the thousands), and the number of patients (possibly in the hundreds). Since all of these numbers were equally indicative of size, I had the SQL query to load the demo sort based on the cube root of these 3 columns multiplied together and cut off after the top ones were loaded. The geometric mean was a simple solution to a normally unsolvable issue in SQL.
There are certain issues with the geometric mean. The biggest one is that it only works with positive numbers. This can be worked around simply by shifting the values into the non-negative range (a form of Normalization) by adding a constant to the given measure before the multiplication.
Another issue is that the numbers compare in all the same direction. Going back to restaurant measures, while you would probably want the highest rating and popularity, you would want the lowest distance and cost. The most basic would be to separate the measures into two different means and sort by them separately. The problem here is that it requires one of the means to take priority over the other, the exact problem the means were being used to solve in the first place. The second solution would be to subtract the measure from a constant, basically reversing its value. This is probably the best possible solution, the issue with it being that sometimes there is no obvious upper limit to subtract from (if possible, the maximum of that measure would work best). The last solution would be to use the reciprocal of the numbers to be minimized, which would work fine as long as there are no zeros (or numbers very close to it).
This leads to another issue with the geometric mean, namely that it is asymptotic around 0. If a measure hits 0, no matter how high the other measures in the calculation are, they will never show anything other than 0. Even if numbers happen to be particularly small (close to 0) it could cause trouble with getting a good ordering. A possible workaround would be to add 1 to any measure that gets below 1. So combining this with the above ideas, Normalize all values so that they are at least 1.
This does change the outcomes of the numbers, however. Fortunately, the last issue offsets that problem: Geometric Mean is a weird number. You can sort on it, and you can put it on a graph, but by itself it doesn’t mean anything to a person. You can’t even put a meaningful unit label on the number. To most end users of an analytics system, it looks like an arbitrary number. So if you normalize and shift numbers a bit to make things sort in the right direction and avoid numbers less than 1, it won’t make much difference if it’s not the exact value it needs to be, as long as you do such transformations consistently.
So in short, the next time you need to choose between different measures to sort on, try the geometric mean. You might not have to choose after all.