TDM 20100: Project 5 — 2022
Motivation: awk
is a programming language designed for text processing. It can be a quick and efficient way to quickly parse through and process textual data. While Python and R definitely have their place in the data science world, it can be extremely satisfying to perform an operation extremely quickly using something like awk
.
Context: This is the first of three projects where we introduce awk
. awk
is a powerful tool that can be used to perform a variety of the tasks that we’ve previously used other UNIX utilities for. After this project, we will continue to utilize all of the utilities, and bash scripts, to perform tasks in a repeatable manner.
Scope: awk, UNIX utilities
Dataset(s)
The following questions will use the following dataset(s):
-
/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt
Questions
Question 1
While the UNIX tools we’ve used up to this point are very useful, awk
enables many new capabilities, and can even replace major functionality of other tools.
In a previous question, we asked you to write a command that printed the number of columns in the dataset. Perform the same operation using awk
.
Similarly, we’ve used head
to print the header line. Use awk
to do the same.
Similarly, we’ve used wc
to count the number of lines in the dataset. Use awk
to do the same.
-
Code used to solve this problem.
-
Output from running the code.
Question 2
In a previous question, we used sort
in combination with uniq
to find the stores with the most number of sales.
Use awk
to find the 10 stores with the most number of sales. In a previous solution, our output was minimal — we had a count and a store number. This time, take some time to format the output nicely, and use the store number to find the count (not store name).
Sorting an array by values in |
You can even use the store number to count the number of sales and save the most recent store name for the store number as you go to print the store names with the output. |
You can pipe output to the
|
-
Code used to solve this problem.
-
Output from running the code.
Question 3
Calculate the total sales (in USD). Do this using only awk
.
gsub(/\$/, "", $2) |
The |
-
Code used to solve this problem.
-
Output from running the code.
Question 4
Calculate the total sales (in USD) by county. Do this using only awk
. Format your output so it looks like the following.
FRANKLIN: $386729.06 HARRISON: $401811.83 Franklin: $2102880.14 Harrison: $2109578.24
Notice anything odd about the result? Look carefully at the dataset and suggest an alternative method that would clean up the issue.
You can see the issue in our tiny sample of output. |
-
Code used to solve this problem.
-
Output from running the code.
Question 5
awk
is extremely powerful, and this liquor dataset is pretty interesting! We haven’t covered everything awk
(and we won’t).
Look at the dataset and ask yourself an interesting question about the data. Use awk
to solve your problem (or, at least, get you closer to answering the question). Explore various stackoverflow questions about awk
and awk
guides online. Try to incorporate an awk
function you haven’t used, or a awk
trick you haven’t seen. While this last part is not required, it is highly encouraged and can be a fun way to learn something new.
You do not need to limit yourself to just use |
-
A markdown cell containing the question you are trying to answer about the dataset.
-
Code used to solve this problem.
-
Output from running the code.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |