TDM 10100: Project 9 — 2022
Benford’s Law
Motivation: Benford’s law has many applications, including its infamous use in fraud detection. It also helps detect anomolies in naturally occurring datasets.
Scope: 'R' and functions
Dataset(s)
The following questions will use the following dataset(s):
-
/anvil/projects/tdm/data/election/escaped2020sample.txt
Helpful Hint
A txt and csv file both store information in plain text. csv files are always separated by commas. In txt files the fields can be separated with commas, semicolons, or tab.
To read in a txt file as a csv we simply add sep="|" (see code below)
myDF <- read.csv("/anvil/projects/tdm/data/election/escaped/escaped2020sample.txt", sep="|")
Questions
Benford’s law (also known as the first digit law) states that the leading digits in a collection of datasets will most likely be small.
It is basically a probability distribution that gives the likelihood of the first digit occurring, in a set of numbers.
Another way to understand Benford’s law is to know that it helps us assess the relative frequency distribution for the leading digits of numbers in a dataset. It states that leading digits with smaller values occur more frequently.
Insider Knowledge
A probability distrubution helps definte what the probability of an event happening is. It can be simple events like a coin toss, or it can be applied to complex events such as the outcome of drug treatments etc.
-
Basic probability distributions which can be shown on a probability distribution table.
-
Binomial distributions, which have “Successes” and “Failures.”
-
Normal distributions, sometimes called a Bell Curve.
Remember that the sum of all the probablities in a distrubution is always 100% or 1 as a decimal.
Helpful Hint
This law only works for numbers that are significand S(x) which means any number that is set into a standard format.
To do this you must
-
Find the first non-zero digit
-
Move the decimal point to the right of that digit
-
Ignore the sign
An example would be 9087 and -.9087 both have the S(x) as 9.087
It can also work to find the second, third and succeeding numbers. It can also find the probability of certian combinations of numbers.
Typically does not apply to data sets that have a minimum and maximum (restricted). And to datasets if the numbers are assigned (i.e. social security numbers, phone numbers etc.) and not naturally occurring numbers.
Larger datasets and data that ranges over multiple orders of magnitudes from low to high work well using Bedford’s law.
Benford’s law is given by the equation below.
$P(d) = \dfrac{\ln((d+1)/d)}{\ln(10)}$
$d$ is the leading digit of a number (and $d \in \{1, \cdots, 9\}$)
An example the probability of the first digit being a 1 is
$P(1) = \dfrac{\ln((1+1)/1)}{\ln(10)} = 0.301$
ONE
-
Create a function called
benfords_law
that takes the argumentdigit
, and calculates the probability ofdigit
being the starting figure of a random number based on Benford’s law. -
Create a vector named
digits
with numbers 1-9 -
Now use the
benfords_law
function to create a plot (could be a bar plot, line plot, dot plot, etc., anything is OK) that shows the likelihood ofdigits
occurring
-
Code used to solve this problem.
-
Output from running the code.
TWO
-
Read in the elections data (we have used this previously) into a dataset named
myDF
. -
Create a vector called
firstdigit
with the first digit from theTRANSACTION_AMT
and then plot it (again, could be a bar plot, line plot, dot plot, etc., anything is OK). -
Does it look like it follows Bedford’s law? Why or why not?
Helpful Hint
use this to help plot
firstdigit <- as.numeric(firstdigit)
hist(firstdigit)
-
Code used to solve this problem.
-
Output from running the code.
THREE
Create a function that will look at both the EMPLOYER
and the OCCUPATION
columns and return a new data frame with an added column named Employed
that is FALSE if EMPLOYER
is "NOT EMPLOYED",
and is FALSE if OCCUPATION
is "NOT EMPLOYED",
and is TRUE otherwise.
-
Code used to solve this problem.
-
Output from running the code.
FOUR
How many arguments does the above function have? What does each line do? Use #comment to explain your function.
Using a graph, can you show the percentage of individuals employed vs not employed?
-
Code used to solve this problem.
-
Output from running the code.
FIVE
Write your own custom function! Make sure your function has at least two arguments and get creative. Your function could output a plot, or search and find information within the data.frame. Use what you have learned in Project 8 and 9 to help guide you.
-
Code used to solve this problem.
-
Output from running the code.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.
In addition, please review our submission guidelines before submitting your project.