The central problem with the sales and use tax audits of large firms is that there are so many transactions that detailed examination of every single transaction is not feasible.
The auditors attempt to perform their work more efficiently by detailed testing of a small sample of sales or purchase invoices and project the sample results over the entire audit period. Frequently the state auditors do not use modern statistical methods for selecting the sample and projecting the results. Although thousands of articles and books have discusses the application of statistics to numerous other areas, few have specifically addressed the problem of sales and use tax audits.
The intent of this paper is to identify significant legal and statistical issues in the audits of large firms that are pervasive across the states. The body of the paper is organized as follows. The first section describes the current sampling practices of state sales and use tax auditors. The second section summarizes how courts have determined the admissibility of statistical evidence, and includes examples from several specific states. The third section discusses sampling and estimation issues from a statistician’s perspective. The final section summarizes the opportunities for research by scholars, practitioners, and policy makers from many disciplines.
The nineteen states reporting routine use of statistical sampling were Arizona, California, Colorado, Connecticut, Illinois, Iowa, Kansas, Maryland, Minnesota, Mississippi, Missouri, New York, Ohio, Pennsylvania, South Carolina, Tennessee, Texas, Wisconsin, and Wyoming. The use of the term “statistical sampling” in the Wisconsin survey is not consistent with the statistician’s definition of that term. As discussed in the third section of this paper, statisticians require that valid statistical sampling procedures include an estimate of the precision or confidence interval in the sample estimate. In the Wisconsin survey, five states (Maryland, Mississippi, Ohio, Texas, and Wyoming) indicated they did not compute the confidence interval or another measure of precision. Thus, these five states are using methods that a statistician would define as nonstatistical sampling rather than statistical sampling.
California, Illinois, and New York have the most experience with statistical sampling. Each of these states reported performing more than 100 statistical sampling audits per year, and has procedures for determining sample size and for producing confidence interval estimates.
More details are available in the Survey on Sampling in Field Audits by George Larscheid. Publication of an updated version of this survey every few years would be very beneficial to state governments, taxpayers, and consulting firms. [Update: The Federation of Tax Administrators is conducting a survey of tax audit sampling practices by the states. The survey results should be publicly available by the end of 2001.]
Most state tax agencies have a designated sampling specialist in their headquarters staff who is expected to provide training and consultation in particular cases. These sampling specialists often lack the time to update software and training programs because of hiring freezes. As the headquarters sampling specialists are requested to assist the field staff and appeals officers with large and complex cases, little time is left to improve training materials. Some states are operating with sales and use tax audit training manuals that have not been revised for more than a decade.
Toxic drug tort litigation has resulted in widely-cited rulings on the use of scientific evidence in federal and state courts.In Daubert v. Merrell Dow, 509 US 579, (1993) the US Supreme Court held that the trial judge must consider many factors in considering whether to admit expert scientific testimony. These factors include whether the theory or technique has been subject to peer review and publication and attracted widespread acceptance within a relevant scientific community. The Texas Supreme Court reached a similar decision for the admissibility of expert testimony in state courts in DuPont v. Robinson, 923 SW 2d 549, (1995, rehearing overruled, 1996).
As the case law applying statistical evidence in tort claims becomes more developed, it will be extended to other areas. Some state courts may hold that the standards of evidence in common law tort claims are different than the standards of evidence for state tax auditors operating under their state’s statutory authority. The legislatures can amend their statutes specifying the methodology for determining state tax deficiencies or can grant rule-making authority to the state tax agency.
Professor Sprowls was hired by Sears in the case of Sears, Roebuck and Co. versus the City of Inglewood, tried in Los Angeles Superior Court in 1955. The City of Inglewood imposed a half-percent sales tax on sales made by stores to residents living within the city limits. Sears’ internal auditors discovered the Sears store in Inglewood had incorrectly estimated the amount of tax-exempt sales made to out-of-city residents. A sample of a few days was performed and an estimate was made that Sears had overpaid the sales tax in the amount of $27,000 for the eleven calendar quarters beginning January 1, 1949. After the City of Inglewood refused Sears’ refund claim for $27,000, Sears sued the city.
Professor Sprowls conducted a statistical sample in support of Sears’ refund claim. He randomly selected 33 days out of 826 working days and had the day’s sales slips examined for in-city or out-of-city addresses. On the basis of this sample he estimated the mean amount of refund for the entire 826 days was $28,250 with a 95 percent confidence interval between $24,000 and $32,400. Professor Sprowls testified before the judge, but the judge held Sears must prove its refund claim on each individual transaction rather than on the sample information. Subsequently, a complete audit of all transactions over the entire 826 days was performed and the actual refund amount was determined to be $26,750.22.
In his article, Professor Sprowls pointed out that the actual number determined by a complete audit was well within his confidence interval ($24,000 to $32,400), and was quite close to the original $27,000 claim submitted by Sears. The city, taxpayer, and court system could have avoided the substantial time and expense of the litigation by accepting the sample estimate. Professor Sprowls urged attorneys, judges, and statisticians to work together to bring about the acceptance of sample data as evidence in courts of law.
The Sprowls article illustrates how little progress has been made over the past 40 years. Courts and tax collectors still have trouble understanding the concept of confidence interval estimation. States are willing to use sampling for determining deficiencies, but not for estimating refund claims. Of course, the amounts at stake are much larger today. Sample stratification techniques are available to apply to the audits of large complex multi-location taxpayers.
When information in the possession of the commissioner indicates that the amount required to be collected or paid under this chapter is greater than the amount remitted by the vendor or paid by the consumer, the commissioner may audit a sample of the vendor's sales or the consumer's purchases for a representative period, to ascertain the per cent of exempt or taxable transactions or the effective tax rate and may issue an assessment based on the audit. The commissioner shall make a good faith effort to reach agreement with the vendor or consumer in selecting a representative sample period.
Observe that the Ohio statute specifically authorizes the use of a “representative sample period”, which could be interpreted as one or two months of transactions selected from an examination period covering several years. Furthermore, the Ohio statute makes no mention of statistical sampling.
Lubrizol contended that the Ohio Tax Commission should have reviewed all transactions in the audited period for both tax underpayments and tax overpayments. If overpayments had been included in the sample, they would have offset some of the underpayments and reduced the total estimated tax, penalty, and interest. However, the state court held against the taxpayer and found the statute required that the auditor was only required to search for underpayments.
If the taxpayer demonstrates that any sampling method used by the comptroller was not in accordance with generally recognized sampling techniques, the audit will be dismissed as to that portion of the audit established by projection based upon the sampling method, and a new audit may be performed [emphasis added].
The term “generally recognized sampling techniques” has not been defined in a professional standard issued by any organization of expert statisticians or auditors. Generally accepted accounting principles (GAAP) and generally accepted accounting standards (GAAS) are promulgated by various accounting standard-setting organizations. However, the leading statistics organization, the American Statistical Association, has not issued professional standards on concepts such as “generally recognized sampling techniques.” The Texas statute does not specify how a taxpayer can demonstrate the sampling method was inappropriate.
Anecdotal comments from various people indicate that the Texas Comptroller’s Office is interested in improving its statistical procedures and training without prolonged litigation. The Comptroller nearly always settles cases involving sampling methodology before a trial occurs in state district court.
In nonstatistical sampling, the auditors estimate sampling risk by relying on professional judgment. The severe limitation of nonstatistical sampling is that it does not allow the auditor to make a quantitative estimate of sampling risk. An example of nonstatistical sampling is block sampling in which the auditors select a few days or weeks from the population which the auditor or taxpayer deems to be representative of the entire population. By not taking sample transactions over the entire audit period, block samples run the risk of producing sample information that is relevant only to the period for which the sample was taken. If the tax deficiency rate in the sample differs significantly from the population, the block sampling method will produce results that are not valid.
Statistical sampling methods provide a quantitative estimate of the sampling risk. Statistical sampling requires that the person selecting the sample relies on a random sample selection process rather than his or her judgment about the extent to which the sample represents the population. The statistical sample might not be a good representation of the population in some instances, but this sampling risk can be quantified using statistical formulas derived from the theory of probability.
The Sprowls example discussed above is a good example of statistical sampling. Sprowls randomly selected 33 days out of a population of 826 working days rather than relying on a person’s judgment about representative days. (A modern computerized sales ledger system could have enabled Sprowls to randomly sample from a population of thousands of individual transactions rather than the 826 days.) He made a point estimate that the mean refund for the population was $28,250, and produced a 95 percent confidence interval estimate of between $24,000 and $32,400. Thus, he was 95 percent confident that the true amount of the refund in the population of all sales was between $24,000 and $32,400.
Fundamental to statistical sampling is the concept that the probability of selecting a particular item from the population is known before the sample is drawn. In simple random sampling, each transaction in the population has the same chance as any other transaction for being selected in the sample. If the number of transactions in the population under audit is large, stratified random sampling can be used which stratifies the population into subgroups according to specified attributes. For example, a population of one million transactions with a maximum transaction amount of $100,000 may be subdivided into five mutually-exclusive strata, such as $0 to $19,999, $20,000 to $39,999, and so forth. The complete sample consists of aggregating samples randomly drawn from each stratum. The probability of selection could differ in each stratum. For example, the probability of selection could range from 0.01 percent in the first stratum to 10 percent in the fourth stratum. A complete census, which is a 100 percent sample, could be conducted for all transactions in a particular stratum where individual errors could be a significant part of the estimate for the entire population, such as all transactions over $90,000 dollars in value.
Statement of Auditing Standards Number 39 (SAS 39), issued by the Auditing Standards Board of the American Institute of Certified Public Accountants, permits financial auditors to use either statistical or nonstatistical sampling. However, SAS 39 specifically requires the auditor to consider materiality in determining sampling risk. SAS 39 cannot be simply translated from financial audits to tax audits without considering the change in materiality and other goals.
To reduce audit costs, financial statement auditors are relying largely on compliance tests of internal control rather than on substantive testing of transactions. In sales and use tax audits, the auditor must be skilled at the process of substantively testing the transactions and must be skilled in the competent use of statistical sampling procedures. Anecdotal evidence from sources in academe and practice indicates financial statement auditors in the 1990’s appear to have less experience and training with statistical sampling than auditors in earlier periods.
Population and purpose
As stated previously, the purpose of most sales and use tax audits based on samples is to estimate the total amount of underpaid taxes, while ignoring tax overpayments. The focus on estimating the total amount of underpaid taxes has been supported by some state courts, such as the Lubrizol case discussed previously. From a statistical point of view, the appropriate objective is to estimate the difference between the tax liability reported by the taxpayer and the state auditor’s estimate of the liability for the population. Positive estimates indicate tax underpayment; negative estimates indicate overpayment; and insignificantly different from zero estimates indicate no change. For example, the Internal Revenue Service uses audits to determine if an income tax return is materially correct and considers both underreported and overreported income in the audit.
The treatment of any tax overpayments observed in the sample is important due to the sample leverage in sales and use tax audits. Suppose, for example, that 100 transactions are sampled from a population of 100,000 transactions and that the sample contains 98 transactions for which taxes were properly paid, one transaction for which a $50 tax underpayment occurred, and one transaction for which a $30 tax overpayment was made. If the overpayment is treated as “no underpayment,” then the average underpayment in the sample is $50/100 = $0.50 per transaction. Multiplying the sample average transaction underpayment by the population size produces the estimated population tax underpayment: (100,000)($0.50) = $50,000. If the credit for the overpayment is applied, then the average underpayment in the sample is ($50 - $30)/100 = $0.20 per transaction and the estimated population tax underpayment is $20,000. The $30,000 difference in the assessment is considerable because any tax underpayment or overpayment has considerable leverage on the estimate due to the high percentage of sampled transactions for which there was neither a tax underpayment nor overpayment.
Most often, strata are determined by using the dollar amount of the transactions as the stratification variable. For example, if transaction amounts have a maximum amount of $1,000,000, then strata are constructed over the interval from $0 to $1,000,000 (e.g., $0 to $49,999; $50,000 to $99,999; $100,000 to $199,999; and so forth). Designing a stratified sampling plan requires answers to three questions:
The answers to these questions require considerable competence and
experience in statistical sampling. For example, there are several ways
to determine the strata sample sizes. They can be chosen to be proportional
to the numbers of transactions in the strata, or they can be chosen to
minimize the variation of the estimate of tax underpayment in each stratum.
In addition, variables other than the dollar size of the transaction can be used as stratification variables. For example, vendor groupings or receiving locations could be used in use tax audits. In sales tax audits, stratification variables such as the destination of goods and services, time period of transactions, or state of origin could be used.
Sales and use tax audit managers prefer to use stratified random sampling due to its statistical advantages over simple random sampling. Unfortunately, stratified sampling methods often stretch the resources and statistical competency levels of state auditors with the consequence that mistakes are made in its implementation and the interpretation of results.
Given cost and estimator precision requirements, required sample sizes can be estimated for most sampling designs. In stratified random sampling, for example, it is possible to determine the overall sample size and to allocate the sample size to the strata based on audit cost considerations (e.g., audit cost per transaction and total audit budget).
Most sampling designs used in sales and use tax audits use predetermined sample sizes. That is, the sample size is determined before the sample is drawn. Recently, there has been some interest in applying adaptive sampling plans for tax audits. In adaptive sampling designs, the sampling process is continued until a specified number of sampled units possessing some attribute is observed. For example, it may be known before the sample is taken that the incidence of transactions with tax underpayments is small in a certain population. As a result, a predetermined sample size of 40 transactions may result in finding no transactions for which tax is underpaid. In adaptive sampling, the process of sampling continues from the population until, say, two transactions are found that contain tax underpayments. A consequence of adaptive sampling is that the sample size is a random variable (i.e., cannot be determined a priori). Further, adaptive sampling plans require a considerable amount of statistical expertise to be used competently.
Missing transactions create a troublesome concern in collecting sample data. How should a missing transaction be treated? The most frequent approach is to replace the missing observation with the observation from an additionally sampled unit. Alternatively, missing observation estimation statistical techniques can be used A sample that contains several missing transactions is certain to raise a red flag in the eyes of the auditor.
Estimation
Point and interval estimates result from probability samples. A point estimate is a single number that is chosen to best estimate an unknown population parameter. In sales and use tax audits, the targeted population parameter is typically the total amount of underpaid taxes. For simple random sampling, for example, the sample mean can be used to estimate the population parameter provided that the total number of transactions in the population is known.
A point estimate of a population parameter does not provide information about the reliability of the estimator. To do that, it is necessary to provide a confidence interval estimate of the population parameter. We might say, for example, that our point estimate of the total amount of tax underpayment by a corporation in a three year period is $5.0 million, and that we are 95% confident that the total amount of tax underpayment is between $4.5 and $5.5 million. The width of the confidence interval provides a measure of the reliability of the point estimator. As the width of the confidence interval increases, the reliability we place in the point estimator value decreases.
In the popular press, the confidence interval is generally reported in terms of the margin of error of the estimator (e.g., “It is predicted that a political candidate will receive 60% of the vote with a margin of error of 4%). The margin of error is approximately the amount to be added and subtracted from the point estimate to produce a 95% confidence interval estimate (e.g., for the example above, the 95% confidence interval is from 56% to 64% of the vote). The margin of error or the confidence interval half-width is determined based on the sample size, the degree of confidence required, and on the variation among the sampled observations. A statistician views a point estimate without a confidence interval estimate as deficient, for there is no indication provided about the reliability of the estimate. Yet, at the present time, only 14 states compute confidence interval estimates in sales and use audits based on statistical samples.
Estimating the confidence interval requires an assumption about the probability distribution function of errors in the population. The typical assumption underlying the most commonly used statistical formulas is that the errors are distributed according to the normal distribution. The normal distribution is the “bell curve” illustrated in statistics textbooks.
Presuming a normal distribution for sales and use tax error estimation is not reasonable for most sales and use tax audits. Normality assumes the errors are distributed continuously in small increments around the mean, such as one cent, ranging from minus infinity to positive infinity. However, in use tax samples, the errors are typically bunched in discrete groups. For example, in a use tax purchase invoice population, 95 percent of the population and sample have zero errors; 3 percent incorrectly paid no tax (deficiency); 2 percent overpaid tax (refund); and virtually none partially paid tax. In this use tax example, the distribution is best modeled by a discrete three-state model (no error, deficiency, and refund), instead of a continuous distribution assumed by the normal distribution.
The confidence interval can be used in another way to assess tax underpayment or overpayment. Instead of using the point estimate, the lower 95% confidence interval limit could be used for assessment. In the above illustration, the lower 95% confidence limit is $1.8 million. Since the taxpayer has already paid $2 million, no deficiency assessment is made since the taxpayer has already paid more than this amount. New York has used the lower 95% confidence limit in its statistical sampling audits as a method of reducing some disputes with taxpayers.
The use of confidence interval estimates for either hypothesis testing or deficiency estimation is controversial. Texas and some other states argue strongly against using the confidence interval estimate as a basis of assessment. They argue that it benefits the taxpayer at the expense of the state and adds cost to the audit by increasing sample sizes to insure that the confidence interval estimator is valid and reliable.
Many opportunities exist for applied and interdisciplinary research in sales and use tax audit methodology. Ideally, these issues should be studied and discussed by scholars and practitioners from many disciplines. Legal scholars could survey existing statutes, regulations, and court cases. Statisticians could identify scientifically sound procedures. Public policy analysts could assess the impact of methodology changes on revenue collections and relations between taxpayers and government agencies. Educators could design and implement sampling training programs for staff auditors in government, corporations, and consulting firms. Professional associations could fund research and facilitate the exchange of ideas.