Caught by the numbers

Data analytics winnows out possible fraudster

By John Giardino, CFE, CPA


The bank comptroller used internal statistics to point the finger at an over-performing loan officer. Learn how you can use powerful data analytics tools to narrow your list of suspects for fraud examinations.

The case in this article, a composite of several similar cases involving data analytics and statistical analysis applied to fraud examinations, is designed to be a tutorial for CFEs. — ed.

Jeff Baker, controller for a large regional bank, entered the room hoping to get an admission of guilt from his suspect. He came prepared. Baker had spent considerable time carefully preparing his questions and planning his interview tactics, and he had the documentation to back him up. Above all, he was proud that he had identified a sizeable straw-purchase and kickback fraud scheme, which probably involved the bank employee in the interview room. Baker credited the identification of this scheme to effective analytical procedures that included some basic statistical methods.

Weeks before, when Baker reported a significant spike in defaults on mortgage loans through the second quarter of 2013, several of the bank’s board members expressed concern over a recently adopted growth strategy. In late 2012, the bank had eased underwriting requirements in an effort to increase market share of residential mortgage lending; a major component of these new underwriting practices was an across-the-board reduction in debt-to-income (DTI) requirements for borrowers.

Board members were worried about impact to the balance sheet. They wanted assurances that toxic, defaulted assets wouldn’t erode shareholders’ equity. Baker, however, wasn’t convinced that changes in underwriting guidelines were the root cause of the uptick in loan nonperformance. He was aware that the stalling economic climate within the bank’s operating footprint had exacerbated the moral hazard for mortgage fraud, so he had been following several high-profile prosecutions of straw-purchase schemes at other financial institutions throughout the region.
Despite his employer’s sterling reputation based on lending history, business practices and community involvement, Baker worried that the organization’s reluctance to break from the traditional risk management model created a blind spot to internal threats, weak controls and susceptibility to fraud schemes.



Figure 1: The CORREL function

After buying some time from the board members by voicing his suspicions, Baker tested his theory that the easing of DTI requirements was unrelated to the spike in nonconforming loans. He calculated the correlation coefficient of nonperforming loans to DTI requirements. Baker obtained aggregated default rate data of loans made at various DTI requirements from internal management reports and performed his calculations in Microsoft Excel using the CORREL function. (CORREL is based on the mathematical formula, Figure 1, above.)

In this case, he compared two variables — DTI requirements and default rate. If the assumption is that default rate is correlated to DTI, then DTI is the independent variable and default rate is the dependent variable. A correlation coefficient value has a range between -1 and 1; values closer to -1 or 1 indicate a negative or positive correlation, respectively. In a negative correlation, the default rate would decrease as DTI requirements increase. In this particular case, the correlation coefficient was very close to 0, which indicates a loose correlation between required DTI and default rate. Baker was correct on his first assumption — the new growth initiatives weren't a significant driver in loan nonperformance.


In statistical terms, a probability distribution refers to a graph, table or formula, which illustrates the probability for each value of a random variable, such as household income, IQ or set of test scores. The normal probability distribution is perhaps the most widely known and is commonly referred to as a “bell curve” based on its appearance; the mean of the distribution is represented by the top of the bell curve, as it represents the expected value. The bell curve is symmetric, and a key related concept is variation from the mean, measured by the standard deviation. In a normal distribution nearly all of the possible values — 95 percent, in fact — fall within two standard deviations of the mean. The greater the standard deviation, the more possible values could occur naturally. 

Baker determined the probability distribution of original loan values of the bank’s outstanding mortgage loans based on the mean and standard deviation values obtained from aggregated internal data. The distribution he identified is represented in Figure 2 below.


Figure 2: Probability distribution - original loan amounts

This chart indicates that the highest probability of original mortgage loan amounts that the bank originated will be close to the mean value of $213,157. Because 95 percent of all original loans issued by the bank will be within two standard deviations of the mean in either direction, almost all of the mortgage loans made by the bank have an original loan balance between $76,767 and $349,547. If nonperformance of loans occurred randomly and wasn’t tied to any particular characteristic of underwriting, Baker would expect to observe a similar distribution for loans currently in default. However, a statistical analysis of data on the nonperforming loans that the bank originated reveals significantly different characteristics, as presented in Figure 3 below.


Figure 3: Probability distribution - original loan amounts (nonperforming) 

The population presented in this distribution is those loans within the bank’s portfolio that are in default. The statistical mean original loan amount on these nonperforming loans is $95,132 — significantly lower than the mean original loan amount of the bank’s entire mortgage portfolio. The observed standard deviation of this population is $26,538; based on a normal probability distribution, Baker concluded that virtually all nonperforming loans were originated at amounts between $42,056 and $148,208. There’s statistical significance in this disparity: loans originated by the bank that are in default exhibit a much lower mean original loan amount and degree of variability than the entire mortgage loan portfolio.

Baker was aware that there were many possible fraud and non-fraud scenarios that would explain this disparity. Borrowers with original mortgage loan amounts between $42,056 and $148,208 may present a greater credit risk based on volatile employment situations or adverse credit histories. Mortgage loans in this range typically require a small down payment, which increases the borrowers’ incentive to “walk away” when situations become dire. Baker noted that the dispersion of the nonperforming loans is also very narrow: the coefficient of variation — or the ratio of standard deviation to the mean — is less than 1/3 ($26,538/$95,132 = .27). In other words, there’s a very narrow range of original loan amount in which a majority of his employer’s loans default above and below the statistical mean.

Baker focused on one particular characteristic based on his knowledge of his organization’s internal control structure: The mean value of the original loan amount on nonperforming loans is slightly below $100,000, and the bank requires secondary approval on those mortgages with original loan amounts above that threshold. This secondary approval serves as a check against unauthorized (and potentially fraudulent) loan origination. Baker’s analysis of the probability distribution of nonperforming mortgage loan data indicated the secondary approval control might have been circumvented in a mortgage-fraud scenario.

For full access to story, members may sign in here.

Not a member? Click here to Join Now. Or Click here to sign up for a FREE TRIAL.