Base R is a Thing

Getting Prepped

First Things First! Set your Working Directory

Your working directory is simply where your script will look for anything it needs like external data sets. There are a few ways to go about doing this which we will cover. However for now, just do the following:

  1. Open up a new script by going to File > New File > R Script.
  2. Save it in a preferably empty folder as whatever you want.
  3. Go to the menu bar and select Session > Set Working Directory > To Source File Location.

Download the script

Copying and pasting syntax from a browser can cause problems. To avoid this issue, please download a script with all of the needed code presented in this walkthrough.

Foundational Structures

Akin to how atoms were viewed as the building blocks of life, these are the basic structures of R.

  • vector - The very basic structure of R where one type of information can be stored. These are given by a single row or column of things that have to be the same data type.

    nothing_vector <- c()
    
    nothing_vector
    ## NULL
    number_vector <- c(1,2,3,4,5)
    
    number_vector
    ## [1] 1 2 3 4 5
    character_vector <- 
      c(
        "Oban", 
        "Macallan", 
        "Pittyvaich", 
        "Balblair", 
        "Glenmorangie"
        )
    
    character_vector
    ## [1] "Oban"         "Macallan"     "Pittyvaich"   "Balblair"     "Glenmorangie"
  • object - These are the different ways you can structure your data to do something with it. All objects are made up of one or more vectors.

Data Types

There are six types of data that R can read on its own without the help of any other piece of software. These are known as atomic vectors where atomic means the vector only holds data of a single type

  1. character: “a”, “swc”
  2. numeric: 2, 15.5
  3. integer: 2L (the L tells R to store this as an integer)
  4. logical: TRUE, FALSE
  5. complex: 1+4i (complex numbers with real and imaginary parts)1
  6. raw: 01010010

Classes

A class is the blueprint that helps to create an object. These are

  • matrix - A two-dimensional object made up of rows and columns. These are essentially a bunch of vectors stacked on top or beside each other that have to be the same data type.

    some_matrix <- matrix(c(1,1,2,3,5,8), nrow = 2, ncol = 3)
    
    some_matrix 
    ##      [,1] [,2] [,3]
    ## [1,]    1    2    5
    ## [2,]    1    3    8
  • array - A three-dimensional object made up of a bunch of matrices that have to be the same data type.

    some_array <- array(1, dim = c(2, 3, 3))
    
    some_array
    ## , , 1
    ## 
    ##      [,1] [,2] [,3]
    ## [1,]    1    1    1
    ## [2,]    1    1    1
    ## 
    ## , , 2
    ## 
    ##      [,1] [,2] [,3]
    ## [1,]    1    1    1
    ## [2,]    1    1    1
    ## 
    ## , , 3
    ## 
    ##      [,1] [,2] [,3]
    ## [1,]    1    1    1
    ## [2,]    1    1    1
  • dataframe - Similar to a matrix, BUT columns can vary by data type.

    some_df <- data.frame(col1 = 1:3, 
                     col2 = c("take", "a", "nap"), 
                     col3 = c(TRUE, FALSE, TRUE), 
                     col4 = c(1.5, 3.2, pi))
    
    some_df
    ##   col1 col2  col3     col4
    ## 1    1 take  TRUE 1.500000
    ## 2    2    a FALSE 3.200000
    ## 3    3  nap  TRUE 3.141593
  • list - A set of objects.

    some_list <-list(1,2,3,4,5)
    
    some_list
    ## [[1]]
    ## [1] 1
    ## 
    ## [[2]]
    ## [1] 2
    ## 
    ## [[3]]
    ## [1] 3
    ## 
    ## [[4]]
    ## [1] 4
    ## 
    ## [[5]]
    ## [1] 5

Example: 100% Fake Data

height <- c(145,167,176,123,150) 
weight <- c(51,63,64,40,55)

Both height and weight are called variables because they have been assigned a bunch of stuff which in this case are integers. Well we have these, might as well do something with them.

height <- c(145, 167, 176, 123, 150)
weight <- c(51, 63, 64, 40, 55)

plot(height,weight)

Well that wasn’t exciting at all was it? It’ll get interesting soon.

R provides many functions to examine features of vectors and other objects, for example

  1. class() - what kind of object is it (high-level)?
  2. typeof() - what is the object’s data type (low-level)?
  3. length() - how long is it? What about two dimensional objects?
  4. attributes() - does it have any metadata? (underlying set of data that describes and gives information about other data)
class(weight) # Yup there are numbers so numeric makes sense.
## [1] "numeric"
typeof(weight) # You see the term 'double' here but you may also see 'single' or 'integer'. All this refers to is the precision of the number and you can assume they are all integers.
## [1] "double"
length(weight) # There are 5 objects in weight so length(weight) = 5 makes sense.
## [1] 5
attributes(weight) # Null since there isn't any other information except for what you see.
## NULL

Data Frames - A Format You Absolutely Want

We want all of our data to be in this form. A data frame is a table or a two-dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column. Characteristics of a data frame are as follows and in no particular order:

  1. The column names should be non-empty.
  2. The row names should be unique.
  3. The data stored in a data frame can be of numeric, factor or character type.
  4. Each column should contain same number of data items.

NOTE: You can always check if a variable is a data frame by is.data.frame(whatever)

Importing and Viewing Data

Read in the file USpop2010to2017.csv that has data on US population estimates from 2010 to 2017. You can do this in multiple ways but for now and since you’re getting used to menu. To bring in the file, go to the Environment tab in the upper right window and then Import Dataset > From Text (readr) > Browse to find your file and load it in. Before clicking on the Import button, name the variable population under Import options. You should get an output similar to the one shown below.

But know that we will be moving away from the idea of using a mouse/trackpad to do things because it wastes time and is generally disruptive. Instead we’ll use the file’s relative path, that is the locations of all other files in comparison to your script to bring in data. If the USpop2010to2017.csv data set is in the same folder as your script, then run the following command

population <- read.csv("USpop2010to2017.csv", 
                       header = TRUE, 
                       stringsAsFactors = TRUE) 

Now take a look at the file by running

population
##         Census_ID Collection_year Total_Pop  Male_Pop Female_Pop Estimate
## 1 POPESTIMATE2010            2010 309338421 152082993  157255428        N
## 2 POPESTIMATE2011            2011 311644280 153242210  158402070        Y
## 3 POPESTIMATE2012            2012 313993272 154452348  159540924        Y
## 4 POPESTIMATE2013            2013 316234505 155596820  160637685        Y
## 5 POPESTIMATE2014            2014 318622525 156807419  161815106        Y
## 6 POPESTIMATE2015            2015 321039839 158048153  162991686        Y
## 7 POPESTIMATE2016            2016 323405935 159243817  164162118        Y
## 8 POPESTIMATE2017            2017 325719178 160408119  165311059        Y

Well that is long and annoying. We can use the head() command to make it only display six lines.

head(population)
##         Census_ID Collection_year Total_Pop  Male_Pop Female_Pop Estimate
## 1 POPESTIMATE2010            2010 309338421 152082993  157255428        N
## 2 POPESTIMATE2011            2011 311644280 153242210  158402070        Y
## 3 POPESTIMATE2012            2012 313993272 154452348  159540924        Y
## 4 POPESTIMATE2013            2013 316234505 155596820  160637685        Y
## 5 POPESTIMATE2014            2014 318622525 156807419  161815106        Y
## 6 POPESTIMATE2015            2015 321039839 158048153  162991686        Y

or any number actually

head(population, 4)
##         Census_ID Collection_year Total_Pop  Male_Pop Female_Pop Estimate
## 1 POPESTIMATE2010            2010 309338421 152082993  157255428        N
## 2 POPESTIMATE2011            2011 311644280 153242210  158402070        Y
## 3 POPESTIMATE2012            2012 313993272 154452348  159540924        Y
## 4 POPESTIMATE2013            2013 316234505 155596820  160637685        Y

You can actually look at the entire dataset in a spreadsheet like window by using

View(population)

If you want to know more about how a data set is made up you can use the command str() or glimpse(). This essentially asks R for information about the internal structure of any set of data.

str(population)
## 'data.frame':    8 obs. of  6 variables:
##  $ Census_ID      : chr  "POPESTIMATE2010" "POPESTIMATE2011" "POPESTIMATE2012" "POPESTIMATE2013" ...
##  $ Collection_year: int  2010 2011 2012 2013 2014 2015 2016 2017
##  $ Total_Pop      : int  309338421 311644280 313993272 316234505 318622525 321039839 323405935 325719178
##  $ Male_Pop       : int  152082993 153242210 154452348 155596820 156807419 158048153 159243817 160408119
##  $ Female_Pop     : int  157255428 158402070 159540924 160637685 161815106 162991686 164162118 165311059
##  $ Estimate       : chr  "N" "Y" "Y" "Y" ...
glimpse(population)
## Rows: 8
## Columns: 6
## $ Census_ID       <chr> "POPESTIMATE2010", "POPESTIMATE2011", "POPESTIMATE201…
## $ Collection_year <int> 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017
## $ Total_Pop       <int> 309338421, 311644280, 313993272, 316234505, 318622525…
## $ Male_Pop        <int> 152082993, 153242210, 154452348, 155596820, 156807419…
## $ Female_Pop      <int> 157255428, 158402070, 159540924, 160637685, 161815106…
## $ Estimate        <chr> "N", "Y", "Y", "Y", "Y", "Y", "Y", "Y"

This may seem like a mess but it actually tells you a lot. First thing you should notice is that all of the columns are laid out into lines to be “easily” readable but the structure of the data set does not change! Here is some other important things in no particular order:

  1. We have an 8 x 5 matrix (aka a data set with 8 columns and 5 rows)
  2. It is a data frame with variables given by $. More about this below.
  3. The first column is a character vector (made up only of strings of text).
  4. The remaining columns are integer vectors (made up only of numbers in the form of integers).
  5. You may be wondering why we skipped over the factors and levels. That’s a bit complicated for this session but we will get to it because they play an important role in visualizing the data.

Looking at Variables in a Data Frame

To call a variable in the dataframe, we use the $ notation.

population$Census_ID
## [1] "POPESTIMATE2010" "POPESTIMATE2011" "POPESTIMATE2012" "POPESTIMATE2013"
## [5] "POPESTIMATE2014" "POPESTIMATE2015" "POPESTIMATE2016" "POPESTIMATE2017"

Now try to use the same method for the other two columns.

Solutions
population$Collection_year
## [1] 2010 2011 2012 2013 2014 2015 2016 2017
population$Total_Pop
## [1] 309338421 311644280 313993272 316234505 318622525 321039839 323405935
## [8] 325719178
population$Male_Pop
## [1] 152082993 153242210 154452348 155596820 156807419 158048153 159243817
## [8] 160408119
population$Female_Pop
## [1] 157255428 158402070 159540924 160637685 161815106 162991686 164162118
## [8] 165311059

Assigning Variables

Now let’s call these something because writing those out over and over is annoying

census <- population$Census_ID

census
## [1] "POPESTIMATE2010" "POPESTIMATE2011" "POPESTIMATE2012" "POPESTIMATE2013"
## [5] "POPESTIMATE2014" "POPESTIMATE2015" "POPESTIMATE2016" "POPESTIMATE2017"

or to do this quicker, use a semicolon ;.

census <- population$Census_ID; census
## [1] "POPESTIMATE2010" "POPESTIMATE2011" "POPESTIMATE2012" "POPESTIMATE2013"
## [5] "POPESTIMATE2014" "POPESTIMATE2015" "POPESTIMATE2016" "POPESTIMATE2017"

Now try to use the same method for the other columns by using the variables year, total, male and female, respectively.

Solutions
year <- population$Collection_year; year
## [1] 2010 2011 2012 2013 2014 2015 2016 2017
total <- population$Total_Pop; total
## [1] 309338421 311644280 313993272 316234505 318622525 321039839 323405935
## [8] 325719178
male <- population$Male_Pop; male
## [1] 152082993 153242210 154452348 155596820 156807419 158048153 159243817
## [8] 160408119
female <- population$Female_Pop; female
## [1] 157255428 158402070 159540924 160637685 161815106 162991686 164162118
## [8] 165311059

Basic Visualizations

Let’s plot these

plot(year, total)
# Notice that the x-axis variable goes first!

OK well that’s better but it looks funny - the y-axis looks like the output on a mid 90s calculator that couldn’t handle exponents. So we’re going to divide and round. Since everything is in hundreds of millions in our data set, let’s divide each column by one hundred million but keep the original figures.

# We don't divide anything in census because it just contains the original variables
census <- population$Census_ID 

# We don't want to divide anything here because this column has data on years
year <- population$Collection_year 

# These are the actual data so that's where we'll reduce. 
tinytotal <- population$Total_Pop/100000000 
tinymale <- population$Male_Pop/100000000
tinyfemale <- population$Female_Pop/100000000

Now that these are proportionally smaller, let’s hope the visuals are more understandable

plot(year, tinytotal) 

Looks much better but let’s change the axis labels to be more descriptive.

plot(year, tinytotal, 
     xlab = "Years", 
     ylab = "Number of People (in hundreds of millions)") 

That’s good but the size of the labels should be bigger.

plot(year, tinytotal, 
     xlab = "Years", 
     ylab = "Number of People (in hundreds of millions)", 
     cex.lab = 1.5) 

Well its bland so how about a little color?

plot(year, tinytotal, 
     xlab = "Years", 
     ylab = "Number of People(in hundreds of millions)", 
     cex.lab = 1.5, 
     col.lab="#ee4035", 
     col="#7bc043")

Sure but those dots are difficult to see. Let’s make them different.

plot(year, tinytotal, 
     xlab = "Years", 
     ylab = "Number of People(in hundreds of millions)", 
     cex.lab = 1.5, 
     col.lab="#ee4035",
     col="#7bc043", 
     pch = 19)

No bigger!

plot(year, tinytotal, 
     xlab = "Years", 
     ylab = "Number of People (in hundreds of millions)", 
     cex.lab = 1.5, 
     col.lab="#ee4035",
     col="#7bc043", 
     pch = 19, 
     cex = 1.5)

I said bigger dammit!

plot(year, tinytotal, 
     xlab = "Years", 
     ylab = "Number of People (in hundreds of millions)", 
     cex.lab = 1.5, 
     col.lab="#ee4035",
     col="#7bc043", 
     pch = 19, 
     cex = 2)

And without that annoying box

plot(year, tinytotal, 
     xlab = "Years", 
     ylab = "Number of People (in hundreds of millions)", 
     cex.lab = 1.5, 
     col.lab="#ee4035",
     col="#7bc043", 
     pch = 19, 
     cex = 2,
     bty = "n")

Maybe a colorful title?

plot(year, tinytotal, 
     xlab = "Years", 
     ylab = "Number of People (in hundreds of millions)", 
     cex.lab = 1.5, 
     col.lab="#ee4035",
     col="#7bc043", 
     pch = 19, 
     cex = 2,
     bty = "n",
     main = "United States Population Estimates (2010-2017)", 
     cex.main = 1.5,
     col.main = "#0392cf")

And a line

plot(year, tinytotal, 
     xlab = "Years", 
     ylab = "Number of People (in hundreds of millions)", 
     cex.lab = 1.5, 
     col.lab="#ee4035",
     col="#7bc043", 
     pch = 19, 
     cex = 2,
     bty = "n",
     main = "United States Population Estimates (2010-2017)", 
     cex.main = 1.5,
     col.main = "#0392cf",
     type = "o")

No that’s too skinny!

plot(year, tinytotal, 
     xlab = "Years", 
     ylab = "Number of People (in hundreds of millions)", 
     cex.lab = 1.5, 
     col.lab="#ee4035",
     col="#7bc043", 
     pch = 19, 
     cex = 2,
     bty = "n",
     main = "United States Population Estimates (2010-2017)", 
     cex.main = 1.5,
     col.main = "#0392cf",
     type = "o",
     lwd = 3.5)

That’ll do for now but that is a ton of work and a lot of not so obvious commands to memorize. So what’s the big deal?

A Bit of Context

Well what you just went through is the OG R which was released publicly in 1995. While nowadays it is commonly known as Base R, a majority of its functionality, structure and syntax were derived from another piece of software called S-Plus which was a commercial product from 1988 primarily used only by statisticians. Regardless of the name, this would continue to be the case until the 2007 when Hadley Wickham officially released an R add-on, or package called ggplot2. Little did anybody know that this would alter the future of data visualization. Starting next week, we’ll move on from Base R to ggplot and its companion packages known as the tidyverse. If you want to get a preview, click on the image below

Tidyverse Hex


Now that’s not to say there isn’t value in knowing Base R. There are many commands that are quick and easy and don’t need add-ons. In fact, some of the syntax has survived the transition to the tidy format we’ll get to next week so for now, back to Base R!

Basic Visualizations

Histograms

For this part, we will use data on the number of users on facebook. Lets read the data into R like so

facebook <- read.csv("ActiveFacebookUsers.csv", 
                       header = TRUE, 
                       stringsAsFactors = TRUE) 

Take a look at the first six rows and get some information about the columns

head(facebook)
##   Quarter Number.of.users.in.millions
## 1  Q3 '08                         100
## 2  Q1 '09                         197
## 3  Q2 '09                         242
## 4  Q3 '09                         305
## 5  Q4 '09                         360
## 6  Q1 '10                         431
str(facebook) 
## 'data.frame':    39 obs. of  2 variables:
##  $ Quarter                    : chr  "Q3 '08" "Q1 '09" "Q2 '09" "Q3 '09" ...
##  $ Number.of.users.in.millions: int  100 197 242 305 360 431 482 550 608 680 ...
# Alternatively, you can use glimpse()
facebook$Quarter
##  [1] "Q3 '08" "Q1 '09" "Q2 '09" "Q3 '09" "Q4 '09" "Q1 '10" "Q2 '10" "Q3 '10"
##  [9] "Q4 '10" "Q1 '11" "Q2 '11" "Q3 '11" "Q4 '11" "Q1 '12" "Q2 '12" "Q3 '12"
## [17] "Q4 '12" "Q1 '13" "Q2 '13" "Q3 '13" "Q4 '13" "Q1 '14" "Q2 '14" "Q3 '14"
## [25] "Q4 '14" "Q1 '15" "Q2 '15" "Q3 '15" "Q4 '15" "Q1 '16" "Q2 '16" "Q3 '16"
## [33] "Q4 '16" "Q1 '17" "Q2 '17" "Q3 '17" "Q4 '17" "Q1 '18" "Q2 '18"
facebook$Number.of.users.in.millions
##  [1]  100  197  242  305  360  431  482  550  608  680  739  800  845  901  955
## [16] 1007 1056 1110 1155 1189 1228 1276 1317 1350 1393 1441 1490 1545 1591 1654
## [31] 1712 1788 1860 1936 2006 2072 2129 2196 2234

Let’s rename the columns

users <- facebook$Number.of.users.in.millions
quarter <- facebook$Quarter

We’re going to use the hist() function to plot a basic histogram of the number of users and quarters

hist(users) 

# Hey its a normal distribution!

We can specify the number of cells for the histogram using the breaks option

hist(users,
     breaks = 40) 

This essentially breaks up the default way that the histogram is constructed. You can also do it within the range of the data set. First to see the minimum and maximum, we use min() and max(), respectively.

min(users) 
## [1] 100
max(users) 
## [1] 2234

So if I wanted to see how this histogram looks say broken into bars of 300, we can do

(max(users) - min(users))/300
## [1] 7.113333

or better yet if we round up, we’ll get the actual number of bars we need. To always round to the next integer, we can use the ceiling() function

ceiling((max(users) - min(users))/300)
## [1] 8

Now you can count in 300s and get the output you want

hist(users, 
     breaks = c(100,400,700,1000,1300,1600,1900,2100,2400))

but that’s a lot of work and its a bit odd because our data aren’t greater than the the maximum. We can still get the eight bars we want of even width if we let R do it for us using the seq, or sequence option with length.out, or the number of desired bars

hist(users, 
     breaks = seq(min(users), 
                  max(users),  
                  length.out=8))

Let’s make it look decent

hist(users, breaks = seq(min(users), 
                         max(users),   
                         length.out=8), 
     xlab = "Active Users (in millions)", 
     main = "Facebook utility worldwide as of 2nd quarter 2018",
     col = c("#3b5998","#6d84b4", "#afbdd4", "#d8dfea", "#ffffff"), 
     border = "#d8dfea", 
     cex.lab = 1.1) 

Notice that you can define colors in a vector and the plot will simply repeat that layout.

Line Graph2

Let’s look at another population, but this time we’ll use Twitter data

twitter <- read.csv("ActiveTwitterUsers.csv", 
                       header = TRUE, 
                       stringsAsFactors = TRUE) 
head(twitter)
##   Quarter Number.of.monthly.active.users.in.millions
## 1   20101                                         30
## 2   20102                                         40
## 3   20103                                         49
## 4   20104                                         54
## 5   20111                                         68
## 6   20112                                         85
str(twitter)
## 'data.frame':    37 obs. of  2 variables:
##  $ Quarter                                   : int  20101 20102 20103 20104 20111 20112 20113 20114 20121 20122 ...
##  $ Number.of.monthly.active.users.in.millions: num  30 40 49 54 68 85 101 117 138 151 ...
twitter$Quarter
##  [1] 20101 20102 20103 20104 20111 20112 20113 20114 20121 20122 20123 20124
## [13] 20131 20132 20133 20134 20141 20142 20143 20144 20151 20152 20153 20154
## [25] 20161 20162 20163 20164 20171 20172 20173 20174 20181 20182 20183 20184
## [37] 20191
twitter$Number.of.monthly.active.users.in.millions
##  [1]  30.0  40.0  49.0  54.0  68.0  85.0 101.0 117.0 138.0 151.0 167.0 185.0
## [13] 204.0 218.0 231.7 241.0 255.0 271.0 284.0 288.0 302.0 304.0 307.0 305.0
## [25] 310.0 313.0 317.0 318.0 327.0 326.0 330.0 330.0 336.0 335.0 326.0 321.0
## [37] 330.0

Let’s rename the columns

tweeters <- twitter$Number.of.monthly.active.users.in.millions
season <- twitter$Quarter

Let’s first create a basic plot

plot(season, tweeters)

There are several types of plot within the plot function which we can access by using the type option

plot(season, tweeters,
     type = "l")

Now try options “o”, “p”, and “b” on your own. Check the solutions below to see if your plot rendered correctly

Solutions
plot(season, tweeters,
     type = "o")

plot(season, tweeters,
     type = "p")

plot(season, tweeters,
     type = "b")

We can also change the line type using the lty option

plot(season, tweeters, 
     type = "l", 
     lty = "dashed")

plot(season, tweeters, 
     type = "l", 
     lty = "dotted")

Well it looks like the solid line is likely the best so we’ll stick with it.

plot(season, tweeters,
     type = "l")

Black and white plots can be great but that’s rarely true for data with ore information than two dimensions.

plot(season, tweeters,
     type = "l",
     col = "#5cb85c")

You can also use a standard palette in R by using numbers

plot(season, tweeters,
     type = "l",
     col = 3)

That line is just too thin. We can make it thicker by using the lwd option

plot(season, tweeters,
     col = "#5cb85c", 
     lwd = 3)

Finally, we can sort out the axes and plot titles

plot(season, tweeters, 
     type = "l", 
     col = "#5cb85c", 
     lwd = 3,
     xlab = "Quarter",
     ylab = "Tweeters (in millions)",
     main = "Active Users on Twitter")

I did promise regression didn’t I? OK well what if we wanted to know the answer to the question: is the Twitter population increasing in size? Well we can add a basic linear regression to the plot to estimate if that’s true. First let’s calculate the regression line…

fit1 <- lm (tweeters ~ season, 
            data = twitter) 

summary(fit1)
## 
## Call:
## lm(formula = tweeters ~ season, data = twitter)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -72.465 -33.476  -3.424  32.275  54.490 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -7.218e+04  4.468e+03  -16.15   <2e-16 ***
## season       3.595e+00  2.218e-01   16.21   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36 on 35 degrees of freedom
## Multiple R-squared:  0.8824, Adjusted R-squared:  0.879 
## F-statistic: 262.6 on 1 and 35 DF,  p-value: < 2.2e-16
plot(fit1)

Scatterplots

Let’s load a dataset of Flower characteristics in 3 species of Iris.

data(iris)
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

There is a lot of data here! Let’s explore using the pairs() function

pairs(iris)

This doesn’t tell us much about the species differences. We can tell R to plot using a different color for the three species of iris

pairs(iris, 
      col = iris$Species)

Sepal.Length and Petal.Length look interesting! Let’s start by looking at that

plot(iris$Sepal.Length, 
     iris$Petal.Length, 
     col = iris$Species)

These points are difficult to see! Let’s pick some different ones using the pch option

plot(iris$Sepal.Length, 
     iris$Petal.Length, 
     col = iris$Species, 
     pch = 15)

plot(iris$Sepal.Length, 
     iris$Petal.Length, 
     col = iris$Species, 
     pch = "A")

pch 21:25 also specify an edge color (col) and a background color (bg)

plot(iris$Sepal.Length, 
     iris$Petal.Length, 
     col = iris$Species, 
     pch = 21, 
     bg = "blue")

OK settling on solid circles…

plot(iris$Sepal.Length, 
     iris$Petal.Length, 
     col = iris$Species, 
     pch = 16)

…that need to be bigger

plot(iris$Sepal.Length, 
     iris$Petal.Length, 
     col = iris$Species, 
     pch = 16,
     cex = 2)

Boxplots

We’ll continue to use the Iris dataset for this section

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Lets first examine the distribution of Sepal Length for each species

boxplot(iris$Sepal.Length ~ iris$Species)

f you wish to compare the medians of the boxplot, you can use the function “notch”. If the notches of two plots do not overlap, this is ‘strong evidence’ that the two medians differ.

boxplot(iris$Sepal.Length ~ iris$Species, 
        notch = T)

You may have noticed that the y-axis labels are always orientated to be perpendicular to the axis. We can rotate all axis labels using the las option.

boxplot(iris$Sepal.Length ~ iris$Species, 
        notch = T,
        las = 1) 

Now we can add in all the axis and plot labels…

boxplot(iris$Sepal.Length ~ iris$Species, 
        notch = T,
        las = 1,
        xlab = "Species", 
        ylab = "Sepal Length", 
        main = "Sepal Length by Species in Iris") 

…and change sizes

boxplot(iris$Sepal.Length ~ iris$Species, 
        notch = T,
        las = 1,
        xlab = "Species", 
        ylab = "Sepal Length", 
        main = "Sepal Length by Species in Iris",
        cex.lab = 1.5,
        cex.axis = 1.5,
        cex.main = 2) 

Plot Options

Rendering more than one plot in a window

# number of rows, number of columns
par(mfrow=c(1,2))

plot(
     # x variable, y variable
     iris$Sepal.Length, iris$Petal.Length, 
     
      # color by species
     col = iris$Species,     
     
     # plot title
     main = "Sepal vs Petal Length in Iris"
     )      

plot(
    # x variable, y variable
     iris$Sepal.Length, iris$Petal.Length, 
     
      # color by species
     col = iris$Species,     
     
     # plot title
     main = "Sepal vs Petal Length in Iris"
     )        

Adding the following sets the plot window back to normal

par(mfrow=c(1,1))    

plot(
     # x variable, y variable
     iris$Sepal.Length, iris$Petal.Length, 
     
      # color by species
     col = iris$Species,     
     
     # plot title
     main = "Sepal vs Petal Length in Iris"
     )   

or you can just clear the plot history by running

dev.off()       
## null device 
##           1

  1. Remember Algebra?↩︎

  2. Now with 22% more linear regression↩︎