[转载]benford\'s law 异常欺诈数据检测

已有 1582 次阅读 2021-1-8 22:37 |个人分类:论文读取与总结|系统分类:论文交流|文章来源:转载

Benford's Law: Potential Applications for Insider Threat Detection

Additional Reading

Aamo, I., "On the Use of Benford's Law to Detect JPEG Biometric Data Tampering." Journal of Information Security, 8, 2017, 240-256.

Reese, M., "Why Cyber Security Should Care About Benford's Law." LinkedIn, 2019.

Sarkar, T., "What is Benford's Law and Why Is It Important for Data Science?" Towards Data Science, 2018.

Detecting anomalous network activity is a powerful way to discover insider threat activities. To establish baseline traffic and process traffic data. This post explores how a mathematical law, already used in forensic accounting, may help detect insider activity without the effort of traditional anomaly detection.

Benford's law of anomalous numbers states that generally, in naturally occurring collections of numbers, the leading digit is likely to be small. The resulting downward-sloping curve can be used as a baseline for determining whether a dataset is genuine or fabricated.

Accountants often compare the leading digits of financial transaction data, such as ledger entries, to a Benford curve to spot anomalies that may indicate fraud. The same technique can be used to detect irregular network activity and other data that may indicate malicious insider activity.


Benford's law is grounded on base-10 logarithms that calculate the probability that number x will begin with digit d if log10(x) lies in the interval of length log10(d+1) - log10(d) = log10(1+1/). When plugging in the digits 1 through 9, each subsequent digit has a diminishing probability that it will be the leading digit.

Diagram of a number line showing the digits 1 through 9. The intervals between each pair of consecutive digits grows smaller as the numbers grow larger. Along the bottom of the number line are the base-10 logarithm values of each number, which mark the probability intervals. The base-10 logarithm of 1 is 0; of 2 is 0.30; of 3 is 0.48; of 4 is 0.60; of 5 is 0.70; of 6 is 0.78; of 7 is 0.85; of 8 is 0.90; of 9 is 0.95; and of 10 is 1.

Figure 1: Logarithmic Intervals of Leading Digits, Based on log10(x)

The size of the number doesn't matter. Whether you're dealing with five-digit or two-digit numbers, the probability of a given leading digit can be predicted for data fitting Benford assumptions by looking at the first two decimals of the base-10 log of the number.

Consider 1,002: log10(1,002) ≈ 3.000867. The first two decimals are within the .00-.30 interval, for base-10 log values of numbers with a leading digit of 1. This position reflects the fact that 30 percent of naturally occurring numbers that fit the Benford assumption have a leading digit of 1. Similarly, consider 52: log10(52) ≈ 1.716000334. The first two decimals are within the .70-.78 interval, for base-10 log values of numbers with a leading digit of 5.

The table shows the base-10 logarithmic values of example 2-, 3-, 4-, and 5-digit numbers starting with the digits 1 through 9. The first two decimals of each logarithm value fall within the probability intervals shown in Figure 1. For example, the base-10 log of 52 has the first two decimal values of .71; of 502, .77; of 5,002, .69; of 50,002, .69.

Table 1: Example of Base-10 Logs for Leading Digits 1-9

The conclusion from all this math: numbers in a dataset that fits all the Benford assumptions should follow this distribution of leading digits, with 1 being the most common and 9 being the least.

The graph shows the expected Benford distribution of leading digits. The values slope down from 30% for leading digit 1 to under 5% for leading digit 9.

Figure 2: Probability Distribution of Leading Digits Under Benford's Law

For a conclusion on a Benford curve to be valid, the data must (1) be numeric, (2) be randomly generated, (3) be large, and (4) represent magnitudes of events. Many types of data fit these assumptions, including population counts, accounting data, and network traffic. Data comprising numbers used as identifiers, such as phone numbers and social security numbers, violates the assumption that the data is generated randomly.

The graph shows the distribution of the leading digits of a population count, overlaid onto a Benford logarithm curve. The curves roughly match.

Figure 3: Leading Digit Distribution of Population Data

Application in Accounting

Benford's law is widely used in accounting to examine data for anomalies that may indicate fraud. Accountancy data generally follows the four assumptions required for a valid conclusion on a Benford curve: general ledgers, income statements, and inventory listings can all be compared to the curve to determine genuineness.

This analysis may be admissible evidence of fraud in federal and state courts. The forensic accounting community generally accepts the methodology, which is referenced in the Fraud Examiners Manual. Forensic accountants, fraud examiners, accountants, and auditors use Benford's law to detect anomalies that require investigation. The combination of the method's widely accepted usage, academic reputation, and wide availability of experts make the admissibility of Benford analyses likely.

Shifting the Framework to Technical Insider Threat

Network traffic typically follows the four assumptions required for a conclusion on the Benford curve to be valid. The Benford analysis' long-standing use in accounting and its suitability for information security's naturally generated data make the process viable for technical insider threat. Benford analysis is especially useful in detecting both highly likely and unlikely data points, so it serves as a dual measure of both normalcy and aberration.

Current cybersecurity systems rely heavily on identifying anomalous behaviors. Looking only for known signatures does not address the breadth of the threat landscape--unknown signatures are equally important. Anomaly detection is generally hard to establish because creating a baseline traffic profile and processing the large amount of traffic data are time-consuming processes.

Benford's law can help avoid the effort of baseline-derived anomaly detection. If the network traffic conforms to the assumptions of Benford's law, any traffic data deviating from the Benford curve can be considered an anomaly. Benford's law performs much of the legwork, rather than manual computation.

A small-scale example application of this technique can be demonstrated with spreadsheet macros.

Insider Threat Applications

To demonstrate the potential applications of Benford's law to insider threat detection, let's explore some scenarios inspired by those we capture in the CERT Insider Threat Incident Corpus.

Fraudulent Invoices

An employee creates fictitious invoice charge data to hide their illicit activity by randomly typing numbers on the horizontal number keys. Another employee notices irregularities in the Benford analysis of the invoice data, and the employee who created the fictitious data is caught.

In this situation, the digits 4, 5, 6, and 7 occur as the leading digits more frequently because of the employee's hand placement on the number keys. Even fabricated data that seems random can be separated from genuine data.

The chart shows the distribution of leading digits of manually generated invoice charges, overlaid on the expected Benford curve. The manually entered leading digits 4, 5, 6, and 7 exceed the Benford curve.

Figure 5: Data Generated by Typing on the Horizontal Number Keys

Data Exfiltration

A disgruntled co-founder of a tech company argues with his partner and decides to leave the company, but not before downloading large trade-secret files. The co-founder has authorized access to the trade secrets and regularly views and works with the files. He deals with numerous uploads and downloads on a daily basis, so he doesn't think he'll get caught.

Measures of network traffic generally follow a Benford curve. Though the co-founder typically deals with the trade secrets and has high network usage, his unexpected increase in normal network activity shifts the distribution of leading digits in the company's network traffic, signaling an abnormality. An analytic to detect changes in the statistical distribution of network activity triggers an alert of suspicious activity. In this case, the co-founder does not get away with it.

IT Sabotage

An employee finds out he is going to be laid off and decides to launch a denial-of-service (DoS) attack on the company's network. The company's IT department has recently established baseline interval times and packet lengths. They are quickly able to identify the anomaly caused by the employee and stop the attack.

Benford's law is especially useful in detecting DoS attacks because flooding a network with data breaks the naturalness of network traffic.

Final Thoughts

It is important to use the resources that we already have access to. Many accounting departments having longstanding experience with Benford analyses, so applying the Benford framework to an information security context should be simpler than creating new techniques for monitoring threshold activity. This control does not rely on labeled historical data. Instead, it leverages the data's natural conformity to the assumptions of Benford's law and tests that conformity against the Benford expectation.

Not all organizational data fits the Benford assumptions. For example, organizations that consistently facilitate transactions with high leading digits may find that the Benford method is of limited use. In the future, we could compare the return on investment and efficacy of using Benford analysis for anomaly detection compared to more conventional statistical methods used for insider threat, such as Bayes' theorem.



该博文允许注册用户评论 请点击登录 评论 (0 个评论)


Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2022-11-29 09:29

Powered by

Copyright © 2007- 中国科学报社