ROI in hiring insights: What is the ROI of your assessment vendor?

January 9th, 2025

Carter Gibson Principal Consultant, IO Psychology Programs, HireVue & Chris Frost Director, IO Client Optimization, HireVue

You’ve bought an assessment. From our experience one of the first questions client organizations ask after launch is reasonable:

Is the assessment working?

Evaluating the effectiveness of your assessment and ROI in hiring seems like a straightforward proposition. Simply compare the candidate assessment scores to their subsequent job performance.

However, one of the most significant challenges with this is obtaining job performance data. It’s often so challenging that many companies don’t even try. But you should, because holding your vendor empirically accountable for the prediction quality of their tools is how you’re going to get the best predicted job performance.

What is performance?

Within IO Psychology this challenge is broadly referred to as “the criterion problem.” More practically, it’s an unfortunate reality that for many companies, there really isn’t great performance data available.

This isn’t to say a company doesn’t know whether an employee was a good hire. It’s just that at many companies, this performance information is only stored in the brains of supervisors and coworkers.

However, understanding how an employee is performing and maintaining an objective data file that can be used in a validation study are two very different things.

Where metrics go wrong

When it comes time to do a validation, we always ask what metrics may be available. And often a company sees certain data, calls it a metric, and then decides that this is what they want to predict with the assessment.

And sometimes this works, but there are a number of challenges and issues that can arise that limit the viability of these datapoints to be used for evaluating an assessment.

Below are some of the most common that come up in our work.

Limited variance

An organization’s understanding of the performance of a specific employee is often limited to a simple meets or exceeds supervisor rating—ratings that serve the function of identifying the extremes. Those who should be promoted, who could be fired, and little else.

However, they aren’t going to be particularly useful as a criterion in a validation study. It’s common to be passed over a metric where over 80% of the sample earns a 3 on a 1-5 performance appraisal. In some cases, we’ve seen ratings where over 95% of the sample have the same score.

If there is no variability in performance ratings, it cannot be used to evaluate an assessment, because there isn’t enough data available for us to predict.

Inconsistent indicators of performance

Call centers are often a treasure trove of data—average handle time, calls per hour, etc.

Many of these metrics are helpful. However, we’ve often seen that average handle time can negatively correlate with a quality metric.

That is, faster isn’t always better.

If you have several performance metrics, it’s essential to look at how they relate to each other. And it’s difficult to build one predictor of performance if scoring well on one indicator of performance negatively relates to another.

Low base rate

A warehouse or delivery driver might place a high value on safety. However, incidents are often so rare that it’s not a great criteria.

In some extreme cases, you may only have a single digit percent of employees who have a workplace accident over a long period of time, and it becomes almost impossible to predict.

Within an employee’s control

Many companies only have store-level metrics but are unable to tie them back to any individual employee. These metrics may be useful to evaluate a manager’s performance, but for individual employees, metrics at the store-level are largely out of their control—and not useful for evaluating a selection tool.

Assessment measurement scope

Does the performance criteria you are examining in the assessment conceptually relate to the performance metrics? We see this most often with turnover.

A work sample simulation-based assessment isn’t going to conceptually relate to turnover, and therefore turnover shouldn’t be used as an outcome here.

Cognitive ability can help employees in complex jobs that call for the use of math and numbers to solve problems, but this predictor is typically unrelated to something like customer service.

Sample Size

This is typically one of the more common challenges we encounter. Calibrating to a metric with only 50 data points does not provide a stable estimate and might not lead to strong prediction in the future.

There’s no line in the sand here—but more is better.

We typically wouldn’t do an analysis with less than 100 matched cases of performance and assessment data, but in our experience, 250 is where we start getting more stable estimates. Hiring is going to be probabilistic, and larger studies mean results are more likely to generalize to future hires.

You’re never guaranteed a good hire because of a good process—all you can do is heavily stack the deck in your favor.

Criterion deficiency

Some metrics may be considered critical for the business but, in reality, represent a very narrow scope of performance.

A call center metric of average handle time is a potential example. The business may want to handle as many customer calls as possible within a short window of time, but in reality, there is much more to being an effective call center agent.

But if we only have one metric and are asked to show impact, you may make decisions about your assessment that do not fully capture the full range of what constitutes a good hire.

Moving Forward

With all of these challenges, it’s perhaps not surprising that many companies don’t even try to gather or maintain criterion metrics that can be used in a validation study. However, that’s not to say they aren’t important.

Here are a few tips for gathering strong metrics.

Interview managers and people familiar with the role to learn what “good” looks like. Often there is data available that may not be considered a metric that can be leveraged in a study. Best practice is to have a “metric guru” call with some stakeholders familiar with a role and how performance is measured—formally or informally.
Examine intercorrelations among performance criteria and get a sense of which metrics are most important. If you have several metrics that are not correlated, it’s best to determine which is most valued to ensure any changes to an algorithm reflect this information.
Collect performance ratings (typically supervisor ratings) as part of calibration study—which HireVue does today. This can be a good way to gather performance data when none is available or as a check to see the quality of existing metrics (i.e., how well do existing metrics correlate with supervisor ratings).

Empirical accountability

If you want the best performing employees you need to track the performance of your assessment. Learning how to design a good criterion validation and determining whether a client has the data necessary to perform one is critical in building an effective assessment strategy.

Many practitioners think of an assessment like a book, where the version they buy is the final version. In reality, an assessment is more like a house, that needs constant attention and work to ensure it is functioning optimally.

And if you want the best ROI from your assessments, you need to hold your assessment vendors accountable.

Ready to learn more? Request a demo today.