TDM 10100: Project 4 — Fall 2022
Many data science tools including R have powerful ways to index data.
Insider Knowledge
R typically has operations that are vectorized and there is little to no need to write loops.
R typically also uses indexing instead of using an if statement.
-
Sequential statements (one after another) i.e.
-
print line 45
-
print line 15
-
if/else statements
create an order of direction based on a logical condition.
if statement example:
x <- 7
if (x > 0){
print ("Positive number")
}
else statement example:
x <- -10
if(x >= 0){
print("Non-negative number")
} else {
print("Negative number")
}
In R
, we can classify many numbers all at once:
x <- c(-10,3,1,-6,19,-3,12,-1)
mysigns <- rep("Non-negative number", times=8)
mysigns[x < 0] <- "Negative number"
mysigns
Context: As we continue to become more familiar with R
this project will help reinforce the many ways of indexing data in R
.
Scope: r, data.frames, indexing.
Make sure to read about, and use the template found here, and the important information about projects submissions here.
Using the f2022-s2023-r kernel
Lets first see all of the files that are in the craigslist
folder
list.files("/anvil/projects/tdm/data/craigslist")
After looking at several of the files we will go ahead and read in the data frame on the Vehicles
myDF <- read.csv("/anvil/projects/tdm/data/craigslist/vehicles.csv", stringsAsFactors = TRUE)
Helpful Hints
Remember:
-
If we want to see the file size (aka how large) of the CSV.
file.info("/anvil/projects/tdm/data/craigslist/vehicles.csv")$size
-
You can also use 'file.info' to see other information about the file.
ONE
It is so important that, each time we look at data, we start by becoming familiar with the data.
In past projects we have looked at the head/tail along with the structure and the dimensions of the data. We want to continue this practice.
This dataset has 25 columns, and we are unable to see it all without adjusting the width. We can do this by
options(repr.matrix.max.cols=25, repr.matrix.max.rows=200)
and we also remember (from the previous project) that we can set the output in R
to look more natural this way:
options(jupyter.rich_display = F)
Helpful Hint
You can look at the first 6 rows (head
), the last 6 rows (tail
), the structure (str
), and/or the dimensions (dim
) of the dataset.
-
How many unique regions are there in total? Name 5 of the different regions that are included in this dataset.
-
How many cars are manufactured in 2011 or afterwards, i.e., they are made in 2011 or newer?
-
In what year was the oldest model manufactured? In what year was the most recent model manufactured? In which year were the most cars manufactured?
Helpful Hint
To sort and order a single vector you can use this code:
head(myDF$year[order(myDF$year)])
You can also use the sort
function, as demonstrated in earlier projects.
-
Code used to solve this problem.
-
Output from running the code.
-
Answers to the 3 questions above.
TWO
-
Create a new column in your data.frame that is labeled
newflag
which indicates if the vehicle for sale has been labeled aslike new
. In other words, the columnnewflag
should beTRUE
if the vehicle on that row islike new
, andFALSE
otherwise. -
Create a new column called
pricecategory
that is-
cheap
for vehicles less than or equal to $1,500 -
average
for vehicles strictly more than $1,500 but less than or equal to $10,000 -
expensive
for vehicles strictly more than $10,000
-
-
How many cars are there in each of these three
pricecategories
?
Helpful Hint
Remember to consider any 0 values and or NA
values
-
Code used to solve this problem.
-
Output from running the code.
-
The answer to the questions above.
THREE
vectoriztion
Most of R’s functions are vectorized, which means that the function will be applied to all elements of a vector, without needing to loop through the elements one at a time. The most common way to access individual elements is by using the []
symbol for indexing.
-
Using the
table()
function, and the columnmyDF$newflag
, identify how many vehicles arelike new
and how many vehicles are notlike new
. -
Now using the
cut
function and appropriatebreaks
, create a new column callednewpricecategory
. Verify that this column is identical to the previously createdpricecategory
column, created in question TWO. -
Make another column called
odometerage
, which has valuesnew
ormiddle age
orold
, according to whether the odometer is (respectively): less than or equal to 50000; strictly greater than 50000 and less than or equal to 100000; or strictly greater than 100000. How many cars are in each of these categories?
Helpful Hint
cut(myvector, breaks = c(10,50,200) , labels = c(a,b,c))
-
Code used to solve this problem.
-
Output from running the code.
-
The answer to the questions above.
FOUR
Preparing for Mapping
-
Extract all of the data for
indianapolis
into adata.frame
calledmyIndy
-
Identify the most popular region from
myDF
, and extract all of the data from that region into adata.frame
calledpopularRegion
. -
Create a third
data.frame
with the data from a region of your choice
-
Code used to solve this problem.
-
Output from running the code.
-
The answer to the questions above.
FIVE
Mapping
Using the R package leaflet
, make 3 maps of the USA, namely, one map for the data in each of the data.frames
from question FOUR.
-
Code used to solve this problem.
-
Output from running the code.
-
The answers to the 3 questions above.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |