STAT 29000: Project 1 — Fall 2021

Markitdown, your first project back in The Data Mine

Motivation: It’s been a long summer! Last year, you got some exposure to both R and Python. This semester, we will venture away from R and Python, and focus on UNIX utilities like sort, awk, grep, and sed. While Python and R are extremely powerful tools that can solve many problems — they aren’t always the best tool for the job. UNIX utilities can be an incredibly efficient way to solve problems that would be much less efficient using R or Python. In addition, there will be a variety of projects where we explore SQL using sqlite3 and MySQL/MariaDB.

We will start slowly, however, by learning about Jupyter Lab. This year, instead of using RStudio Server, we will be using Jupyter Lab. In this project we will become familiar with the new environment, review some, and prepare for the rest of the semester.

Context: This is the first project of the semester! We will start with some review, and set the "scene" to learn about some powerful UNIX utilities, and SQL the rest of the semester.

Scope: Jupyter Lab, R, Python, scholar, brown, markdown

Learning Objectives
  • Read about and understand computational resources available to you.

  • Learn how to run R code in Jupyter Lab on Scholar and Brown.

  • Review R and Python.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

/depot/datamine/data/

Questions

Question 1

In previous semesters, we’ve used a program called RStudio Server to run R code on Scholar and solve the projects. This year, we will be using Jupyter Lab almost exclusively. Let’s being by launching your own private instance of Jupyter Lab using a small portion of the compute cluster.

Navigate and login to ondemand.anvil.rcac.purdue.edu using 2-factor authentication (ACCESS login on Duo Mobile). You will be met with a screen, with lots of options. Don’t worry, however, the next steps are very straightforward.

In the not-to-distant future, we will be using both Scholar (gateway.scholar.rcac.purdue.edu) and Brown (ondemand.brown.rcac.purdue.edu) to launch Jupyter Lab instances. For now, however, we will be using Brown.

Towards the middle of the top menu, there will be an item labeled My Interactive Sessions, click on My Interactive Sessions. On the left-hand side of the screen you will be presented with a new menu. You will see that there are a few different sections: Datamine, Desktops, and GUIs. Under the Datamine section, you should see a button that says Jupyter Lab, click on Jupyter Lab.

If everything was successful, you should see a screen similar to the following.

OnDemand Jupyter Lab
Figure 1. OnDemand Jupyter Lab

Make sure that your selection matches the selection in Figure 1. Once satisfied, click on Launch. Behind the scenes, OnDemand uses SLURM to launch a job to run Jupyter Lab. This job has access to 1 CPU core and 3072 Mb of memory. It is OK to not understand what that means yet, we will learn more about this in STAT 39000. For the curious, however, if you were to open a terminal session in Scholar and/or Brown and run the following, you would see your job queued up.

squeue -u username # replace 'username' with your username

After a few seconds, your screen will update and a new button will appear labeled Connect to Jupyter. Click on Connect to Jupyter to launch your Jupyter Lab instance. Upon a successful launch, you will be presented with a screen with a variety of kernel options. It will look similar to the following.

Kernel options
Figure 2. Kernel options

There are 2 primary options that you will need to know about.

f2021-s2022

The course kernel where Python code is run without any extra work, and you have the ability to run R code or SQL queries in the same environment.

To learn more about how to run R code or SQL queries using this kernel, see our template page.

f2021-s2022-r

An alternative, native R kernel that you can use for projects with just R code. When using this environment, you will not need to prepend %%R to the top of each code cell.

For now, let’s focus on the f2021-s2022 kernel. Click on f2021-s2022, and a fresh notebook will be created for you.

Test it out! Run the following code in a new cell. This code runs the hostname command and will reveal which node your Jupyter Lab instance is running on. What is the name of the node you are running on?

import socket
print(socket.gethostname())

To run the code in a code cell, you can either press Ctrl+Enter on your keyboard or click the small "Play" button in the notebook menu.

Items to submit
  • Code used to solve this problem in a "code" cell.

  • Output from running the code (the name of the node you are running on).

Question 2

This year, the first step to starting any project should be to download and/or copy our project template (which can also be found on Scholar and Brown at /depot/datamine/apps/templates/project_template.ipynb).

Open the project template and save it into your home directory, in a new notebook named firstname-lastname-project01.ipynb.

There are 2 main types of cells in a notebook: code cells (which contain code which you can run), and markdown cells (which contain markdown text which you can render into nicely formatted text). How many cells of each type are there in this template by default?

Fill out the project template, replacing the default text with your own information, and transferring all work you’ve done up until this point into your new notebook. If a category is not applicable to you (for example, if you did not work on this project with someone else), put N/A.

Items to submit
  • How many of each types of cells are there in the default template?

Question 3

Last year, while using RStudio, you probably gained a certain amount of experience using RMarkdown — a flavor of Markdown that allows you to embed and run code in Markdown. Jupyter Lab, while very different in many ways, still uses Markdown to add formatted text to a given notebook. It is well worth the small time investment to learn how to use Markdown, and create a neat and reproducible document.

Create a Markdown cell in your notebook. Create both an ordered and unordered list. Create an unordered list with 3 of your favorite academic interests (some examples could include: machine learning, operating systems, forensic accounting, etc.). Create another ordered list that ranks your academic interests in order of most-interested to least-interested. To practice markdown, embolden at least 1 item in you list, italicize at least 1 item in your list, and make at least 1 item in your list formatted like code.

You can quickly get started with Markdown using this cheat sheet: www.markdownguide.org/cheat-sheet/

Don’t forget to "run" your markdown cells by clicking the small "Play" button in the notebook menu. Running a markdown cell will render the text in the cell with all of the formatting you specified. Your unordered lists will be bulleted and your ordered lists will be numbered.

If you are having trouble changing a cell due to the drop down menu behaving oddly, try changing browsers to Chrome or Safari. If you are a big Firefox fan, and don’t want to do that, feel free to use the %%markdown magic to create a markdown cell without really creating a markdown cell. Any cell that starts with %%markdown in the first line will generate markdown when run.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 4

Browse www.linkedin.com and read some profiles. Pay special attention to accounts with an "About" section. Write your own personal "About" section using Markdown in a new Markdown cell. Include the following (at a minimum):

  • A header for this section (your choice of size) that says "About".

    A Markdown header is a line of text at the top of a Markdown cell that begins with one or more #.

  • The text of your personal "About" section that you would feel comfortable uploading to LinkedIn.

  • In the about section, for the sake of learning markdown, include at least 1 link using Markdown’s link syntax.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 5

Read the templates page and learn how to run snippets of code in Jupyter Lab other than Python. Run at least 1 example of Python, R, SQL, and bash. For SQL and bash, you can use the following snippets of code to make sure things are working properly.

-- Use the following sqlite database: /depot/datamine/data/movies_and_tv/imdb.db
SELECT * FROM titles LIMIT 5;
ls -la /depot/datamine/data/movies_and_tv/

For your R and Python code, use this as an opportunity to review your skills. For each language, choose at least 1 dataset from /depot/datamine/data, and analyze it. Both solutions should include at least 1 custom function, and at least 1 graphic output. Make sure your code is complete, and well-commented. Include a markdown cell with your short analysis, for each language.

You could answer any question you have about your dataset you want. This is an open question, just make sure you put in a good amount of effort. Low/no-effort solutions will not receive full credit.

Once done, submit your projects just like last year. See the submissions page for more details.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

  • 1-2 sentence analysis for each of your R and Python code examples.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.