Benchmark Drives Behavior, the saga continues…

If you see what is right and fail to act on it, you lack courage.

Confucius

Benchmark Drives Behavior, the saga continues…

In my previous article,¹ I covered some of the challenges I encountered during my career that caused me to question issues surrounding benchmarks and the process of creating benchmarks. I touched on how certain benchmarks can lead to problematic outcomes across a myriad of spheres such as safety, financial reporting, retaining employees, etc. In that article, I said that the follow-on article would provide some additional and more detailed examples of where things went wrong along with some analysis. I agreed to provide 4 examples of cases where benchmarks had gone awry. (I set that as my benchmark.) As I wrote this second installment, I realized that I could not meet that benchmark and do the selected examples a modicum of justice. Accordingly, this article contains two examples and addresses some of the feedback I received as a result of the first publication. Do not fear, as we progress through this series more examples will be forthcoming and, in the future, I will strive to set better benchmarks for myself.

I hope that by diving into these examples of benchmark issues, it will help facilitate the development of potentially more robust benchmarks through the utilization of a more thoughtful process of designing and creating those benchmarks. Simultaneously I hope it leads to considering (where possible) a longer-term view when establishing benchmarks. This may help mitigate some of the negative outcomes that are addressed in the examples provided. As part of this second article, I will discuss both the benchmarks and the implementation and monitoring of those benchmarks. Additionally, because of some of the feedback I received on the first article in the series, I would like to clarify a few things and (re-)introduce the fraud triangle and its linkage to benchmarks from a fraud risk perspective along with the distinction between benchmarks and Key Performance Indicators (KPI’s). Sometimes this may be a distinction without a difference.

Benchmarks vs. KPI’s

One of the comments I received related to Benchmarks vis a vis Key Performance Indicators or as we know them KPI’s. More traditionally benchmarking is defined as per the first article in the series, and I have inserted that definition here.

A Benchmark is a standard or point of reference against which things may be compared or assessed.²

KPI’s are also a measure designed to gauge performance, but they may have been designed in isolation and not referenced against other measures. That said, poorly developed KPI’s can effectively have the same impact(s) as poorly developed benchmarks but to be honest, what sounds better, “benchmark drives behavior” or “KPI’s drive behavior?”

I also chose to use the term benchmarks as that is what is used by investment professionals to gauge their performance, and this is an industry obsessed with benchmarks. For example, we desire an outcome 200 bps above the returns provided by the S&P 500, which would be a common benchmark in the investment world. In short, at times, you may be able to insert KPI’s for benchmarks in certain situations or use them interchangeably, but that will not be the case in all instances.

In this 2nd installment of my series, I analyzed two real cases/issues discussing where the benchmarks were potentially poorly developed and or poorly monitored. The examples cover a cross-section of areas where benchmarks may have driven aberrant behavior. Remember that there is often a significant financial component embedded in the behavior that has been or is desired to be driven by many benchmarks. Before we dive into the specific examples where things went off the rails and try to understand why this might have occurred. I want to reiterate the objective is to identify weaknesses and blind spots to better judge and identify areas where the existing benchmarks you might be working with can be enhanced. The solution(s) to some of the issues I have identified are not simple, we must realize that there is a need or business imperative to measure, benchmark, or create KPIs of what are often complex systems. In all fairness, it is generally (barring certain exceptions) not one thing that results in the benchmarks causing improper outcomes but several issues converging to create the unintended outcome.

To support this view, I use the analogy of a car accident. I once had this described to me as follows;

You do not have a car accident but instead what you have is a series of events that lead to a car accident. For example, you get up early at 4 a.m. to fly from Seattle to LA for business meetings. You arrive in LA and fight your way through traffic after dealing with the rental car company. You spend the full day in long meetings some of which are contentious but all of which are tiring. You then have a business dinner which starts at 8 pm. During the dinner which is dragging on a bit longer than you had hoped you have that second glass of wine. It is finally time to leave and for the first time in a while, it starts to drizzle in LA making the road slippery. You are tired and your reflexes are slower and then you are forced to make a quick stop due to a traffic situation. You are a little too slow and there you have it, a fender bender. While failing to stop in time caused the accident, all the aforementioned issues impacted the state of the driver and their ability to react and were ultimately contributing factors to the accident.

It is often the same with benchmarks gone bad. There is not one solitary issue that drove the creation of the benchmark or one issue that resulted in the benchmark becoming something it was not meant to be or being manipulated. Rather it is generally a multitude of factors that created the problem.

Revisiting the Fraud Triangle

Before we dive into the specific examples and some suggestions for managing the issues raised in those examples lets’ quickly (re-)visit the fraud triangle and its relevance from a risk perspective to benchmarks. Initially, I considered that benchmarks would only influence one aspect of the fraud triangle, that of an un-shareable need or pressure. However, upon reflection and because of subsequent discussions with current and former colleagues I am now inclined to believe that faulty benchmarks play into all three sides of the fraud triangle. Previously I considered that a faulty benchmark would increase pressure (unshareable need) on those being measured. It quickly became apparent that faulty or poorly designed benchmarks also present opportunities for gaming/manipulating them. Finally, I have seen employees get so upset with some benchmarks that the rationalization of gaming or manipulating those benchmarks was in their mind a badge of honor. In short, faulty benchmarks create risk and when gauged against the fraud triangle they serve to exacerbate potential fraud and other risks as they relate to all 3 sides of the fraud triangle. Okay, enough said, let’s now look at some real-life examples and hear some suggestions on how to deal with some of the risks.

The Fraud Triangle revisited

Brummell G/ Susan Crews March 25, 2015, News Group

Where it went off the rails, some real-life examples:

Example 1 They went to Sears Automotive and got “Roebucked”

In 1992 California consumer protection officials charged Sears and Roebuck with cheating automobile repair customers.³ I came across this case on numerous fronts. It is addressed in the textbook Fraud Examination ⁴ as part of a forensic accounting curriculum where it was used as an example in the context of Forensic Accounting demonstrating an un-shareable need as defined in the fraud triangle. The Sears case was also utilized as an example related to ethics in a well-written piece contained on BrainMass⁵ related to ethics. I will use the same example with a slightly different slant focusing on the benchmark component that may have been the root cause of the issues.

To summarize, in 1991, Sears unveiled a new “productivity incentive” plan with as its goal or objective to increase profits in their auto centers. This, in my mind, is a valid and appropriate business goal. As part of their lifecycle organizations need to consistently evaluate and update programs and systems to stay competitive. Sears Automotive, before 1992, was known for quality and value, the auto mechanics were paid hourly wages and were expected to meet certain production quotas. Again, these quotas are part of normal operations. Then in 1991, decisions appear to have been made to change the compensation structure to now include a commission component. It would appear that someone considered that this would better align the interests of the organization and the mechanics resulting in a win-win for both, on the face of it a noble objective, but looking back one must ask was the customer considered?

Under the “new program”, the mechanics were going to be paid a base salary augmented by an additional fixed dollar amount if the mechanics also met predetermined hourly production quotas. Additionally, the “Auto Service Advisors” the individuals taking the orders from customers and functioning as the interface between the mechanics and the customers were moved from a pure salary-based compensation model to a program designed to increase sales. This was to result in commissions and product-specific sales quotas being introduced for the advisors as well. In short, the advisors would be compensated for upselling customers for additional services. This is a great program ………..if the customer needs the service. The program was rolled out in 1991 and apparently, it did not take long for issues to arise.

In June 1992, Sears was accused of violating the State’s Auto Repair Act by the California Department of Consumer Affairs which wanted to revoke the licenses of all Sears’s auto centers in California. The company acknowledged problems but denied fraud and/or wrongdoing.

Conversely, the California Department of Consumer Affairs alleged that “Sears enforces a quota system that requires employees to ”sell a certain number of repairs and/or services during every 8-hour work shift,” resulting in overcharges that averaged $233 a car. The charges grew out of an 18-month undercover investigation.⁶

The regulatory response was purportedly the result of a significant increase in the number of consumer complaints and was supported by an undercover operation related to brake repairs. It appears that the new compensation scheme designed to improve sales for Sears Automotive and compensation for the employees resulted in the employees conducting work that was not warranted nor required and selling products that were of no value to the customer.

When this all finally came out, as we often see in such cases, other states piled on and made similar assertions against Sears Automotive. Many publications and government organizations stated that Sears Automotive centers had been systematically misleading customers. The California Department of Consumer Protection went so far as to say that the misleading actions could be tied directly back to Sears Auto Centers’ compensation system.43, system.43 was the new compensation program. For our purposes compensation system.43 is also known as the benchmark.

The Sears CEO at the time denied the charges and claimed no fraud had occurred. The CEO admitted to isolated errors, accepted personal responsibility for creating an environment where “mistakes” had occurred and outlined the actions the company planned to take to resolve the issue, including $46 million in customer coupons⁷. The benchmarks were redesigned post an event that caused significant reputational damage to the organization. If the benchmark had been better thought out and designed, effectively rolled out, and properly controlled and monitored this would have saved costs, goodwill, legal fees, frustration, reputational impacts, and subsequent losses of revenue.

So, what can we glean from the above? It would appear that when the benchmarks were set not all stakeholders and effectively the most important stakeholder for a retailer, the customer, were not adequately considered. Sears Automotive seems to have focused on the top line, more sales, and how this could be achieved, not considering what other behaviors the new benchmarks might drive. I am going to go out on a limb here, but it does not feel like this new benchmark/compensation scheme was piloted properly or piloted at all. If a good pilot had been conducted these issues may have been identified prior to the full rollout.

The risks that the new program created were not properly identified nor were any controls created to mitigate those risks. If the overcharges were truly $233 per car as alleged by the State of California it begs the question, how many dangerous issues did the mechanics fail to discover before the introduction of this new system. The $233 number is significant in that there would have been tracking mechanisms in place to see how the program was working. No one bothered to ask how is this possible? Were we that bad previously? It appears to have been more an example of things are going great do not question it. Regrettably, we saw a similar type of situation in the independent director’s report related to Wells Fargo, referencing that eight is great. It resulted from the pressures placed on staff to “incentivize” them to encourage banking customers to have a minimum of 8 accounts.⁸ This issue has caused significant issues for the Bank.

Example 2 The EPA and the measurement of particulate matter-you can breathe easy now

The next example is more recent and is quite timely considering the current intense focus on Environmental Sustainability and Governance (ESG) initiatives. The issue was raised in the Economist in an Article titled “We Were Expecting You⁹” the article delves into research by Eric Zou. It goes into detail about how a poorly designed regulation (for our purposes Regulation = Benchmark, i.e., the regulation setting the minimum standards) allowed for the gaming of behavior that was supposed to help curtail harmful emissions/pollution. Helpfully, the article also describes how advancements in measuring technologies have reduced the impact on the gaming of the system that may have been occurring related to these measurements in the past.

According to the article and based on research conducted by Eric Zou¹⁰ and covered in the Economist article, the EPA would historically publish, in advance, a list of dates, at six-day intervals on which it required state and local agencies to be measuring for certain harmful particulate matter. In other words, they pre-informed /pre-announced the areas when and where these monitoring tests would be conducted. In a nutshell “hey everybody on these dates we are going to be testing you for compliance”.

The article refers to the forewarnings as being analogous to the police announcing surprise raids or World Anti-Doping Agency (WADA) announcing who and when athletes will be tested for doping violations. I would add from a finance perspective that it is similar to letting auditees know you are coming to conduct a surprise cash count.

In short, whilst the benchmark which set the levels of harmful allowable particulate matter may have been a good one, the way the overall process was designed or worked allowed for gaming and achievement of the benchmark because of the forewarning. The research conducted by Mr. Zou found that on the days that the particulate matter was measured the actual levels of particulate matter were lower than on those days when potential polluters knew they would not be measured. In short, the study found that pollution was generally higher on those days when monitoring did not take place and lower on those days when it did. The results of being out of compliance (i.e. particulate matter was too high) could result in potential fines for local governments and the requirement for certain factories to install expensive clean technology solutions. Those being benchmarked or measured learned quickly that adhering to the requirements of the pre-published dates would ensure benchmark achievement and enable them to continue operating in a potentially less environmentally friendly manner on all other dates when measurements were not mandated. The analogy that stands out for me in the article is the pre-announcement of police raids. We can ask if we are raiding a house to look for narcotics announcing the raid in advance will likely reduce our ability to find the contraband we seek to find.

So, what went wrong? For some reason, it appears a part of the regulation a preannouncement clause was created or simply provided by the EPA. To reiterate, once people knew they were not being observed on certain dates the resulting behavior appears to have changed thereby allowing for increased harmful particulate matter on certain dates with no adverse consequences emanating from this. Without knowing how the measurement process/timeframe was negotiated or agreed to it is not clear where and how this resulted in the documented anomalies. In short, the data developed by the research of Mr. Zou and elaborated on in the Economist supports that pollution increased when measurement did not take place. Accordingly, a benchmark was rolled out that while optically demonstrating action, did not fully reflect the reality of particulate matter emissions. In short, a good benchmark potentially poorly administrated and controlled.

As a result of technological improvements and continuous monitoring in more and more locations, this gaming of the benchmark may be coming to an end. However, put simply why would we tell people the exact period we are going to be checking to determine if they comply with the prescribed rules. The article in the Economist points out that, in this instance, the behavior appears to have been modified to meet the benchmark which resulted in higher perceived emissions reductions. Based on the research carried out by Professor Zou, it also appears that when no testing occurred there were/was increased emissions/pollution. From my point of view, the fault here lies with how the benchmark was monitored and implemented. In this instance the targeted level of ppm related to emissions was admirable, but I do not believe the regulators intended to have the regions achieve the targets every 6th day. In fairness, I am not familiar with what it takes to have rules such as this enacted and approved. It may very well be that from a political perspective there may have been some horse-trading resulting in the system and its flaws until these issues could be continuously monitored, as opposed to spot checks as is now more commonplace.

Interim Conclusion

Based on the previous article and the issues covered in this follow-on document, what are some of the insights we can draw related to benchmarks, their objectives their design, and their monitoring?

Firstly, I always like to come back to the question of “what is the objective?” Then I suggest asking why are we creating this benchmark and is it consistent with our corporate values and Code of Ethics? This should be followed by a re-evaluation of “does the benchmark help us to achieve the objective?” Ideally, a pilot of the benchmark in certain instances is more than appropriate, if possible.

Reaching back into the accounting literature, I think we need to ask WCGW (“what could go wrong”) and apply this more broadly to develop better benchmarks. The benchmark should be stress-tested and subjected to a form of brainstorming session to see what the downside and upside risks are. This brainstorming should be at multiple levels in an organization. The downside risks identified as part of the brainstorming need to have controls developed around them to ensure that the people that are subject to the benchmarks are meeting the objectives as intended. There should be consequences for benchmark manipulation, fudging or otherwise, the organization needs to know its blind spots.

Another concept similar to WCGW is a pre-mortem as outlined by M. Mauboussin in Think Twice¹¹. Here before the decision whether to implement a benchmark a detailed session is held addressing all the downside risks and dealing with those risks prior to executing on the decision, and in our case, implementing a new benchmark.

Without getting too far ahead of ourselves, another exercise to conduct might be gauging the impact that a benchmark has on the various components of the fraud triangle. Is the benchmark neutral to the fraud triangle or does it create higher risks? Higher risks might be the right answer but if that is the case those higher risks must be mitigated through better monitoring, controls, and measuring of results to ensure that people stay in their swim lanes.

These are some good ideas and should be considered when setting benchmarks, how else do you prevent getting hit if you do not know your blind spots?

Finally, the benchmarks should not be developed in isolation or in a vacuum by one group. In the past, this was left to newly minted MBA’s, often external consultants, HR, or personnel with limited input from other areas. Remember, these groups are not required to eat their own cooking and once the benchmark is set, they move on with a, “my work here is done”, mantra. In today’s world, we must utilize the skills and strengths that are available through the insights of multidisciplinary teams. The systems and requirements have become ever more complex and while this presents opportunities for continued advancement and progress it also creates risk. To summarize, those to be gauged by benchmarks need to be involved in the process as does HR, compliance, finance, and any other potentially impacted group. Since there is a need for controls and monitoring surrounding benchmarks, it is probably useful to have the controllers’ function and compliance and internal audit providing input to proposed benchmarks. Furthermore, with the recent focus on ESG, it might pay to have the leaders of that initiative involved with the setting of benchmarks that might impact the areas they are concerned with. This could head off any allegations of greenwashing in the future.

While on the one hand, all this may sound unwieldy and it might take longer to develop suitable benchmarks, on the other hand it is better to do this upfront as opposed to having to deal with investigations, restatements, employee disgruntlement, issues of equity, and other potential legal issues which might arise because of faulty benchmarks. This will ultimately result in the benchmark(s) being redone as per the Sears example above. In short better to get it right the first time around.

The next issue is to address what might be a good process for helping with the setting of solid benchmarks. In the next article in the series, I will attempt to address some of the factors such as risks, controls, and behavioral objectives that one may wish to consider when assessing existing benchmarks or developing new benchmarks. Until next time……………………

Guido van Drunen

^{1. Benchmark Drives Behavior ↩}
^{2. Benchmark Encyclopedia.com ↩}
^{3. The Washington Post, John Yang, June 12, 1992 ↩}
^{4. Fraud Examination 6e Cengage, Steve Albrecht, Chad O Albrecht, Conan Albrecht, Mark F Zimbelman ↩}
^{5. BrainMass Business Philosophy and Ethics 228985 ↩}
^{6. LA Times A Case of Consumer Confidence June 1992 ↩}
^{7. Sears to Repair Image With $46 Million in Coupons: Retailing: It may be the largest such consumer fraud settlement ever. California auto centers will be on probation for 3 years., LA Times September 3rd, 1992 ↩}
^{8. Independent Directors of the Board of Wells Fargo & Company Sales Practices Investigation Report April 10, 2017 ↩}
^{9. The Economist September 4th, 2021, Poorly devised regulations lets firms pollute with abandon ↩}
^{10. Eric Zou, University of Portland analysis of 1200 air monitoring sites ↩}
^{11. Think Twice: Harnessing the Power of Counterintuition, Michael J Mauboussin, Harvard Business Press 2009 ↩}

Member Organizations

Beyond Accounting Series

Explore all Insights

Benchmark Drives Behavior, the saga continues…

Benchmark Drives Behavior, the saga continues…

Benchmarks vs. KPI’s

Revisiting the Fraud Triangle

Where it went off the rails, some real-life examples:

Interim Conclusion

Guido van Drunen