1 Introduction

This RMarkdown document is part of the Generic Skills Component (GSK) of the Course of the Foundation Studies Programme at Srishti Manipal Institute of Art, Design, and Technology, Bangalore India. The material is based on A Layered Grammar of Graphics by Hadley Wickham. The course is meant for First Year students pursuing a Degree in Art and Design.

The intent of this GSK part is to build Skill in coding in R, and also appreciate R as a way to metaphorically visualize information of various kinds, using predominantly geometric figures and structures.

All RMarkdown files combine code, text, web-images, and figures developed using code. Everything is text; code chunks are enclosed in fences (```)

2 Goals

  • Understand different kinds of data variables
  • Appreciate how they can be identified based on the Interrogative Pronouns they answer to
  • Understand how each kind of variable lends itself to a specific geometric aspect in the data visualization.
  • Understand how ask Questions of Data to develop Visualizations

3 Pedagogical Note

The method followed will be based on PRIMM:

  • PREDICT Inspect the code and guess at what the code might do, write predictions
  • RUN the code provided and check what happens
  • INFER what the parameters of the code do and write comments to explain. What bells and whistles can you see?
  • MODIFY the parameters code provided to understand the options available. Write comments to show what you have aimed for and achieved.
  • MAKE : take an idea/concept of your own, and graph it.

3.1 Set Up

The setup code chunk below brings into our coding session R packages that provide specific computational abilities and also datasets which we can use.

To reiterate: Packages and datasets are not the same thing !! Packages are (small) collections of programs. Datasets are just….information.

4 Packages needed

knitr::opts_chunk$set(echo = TRUE,warning = TRUE)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(palmerpenguins)

5 Introduction

In this RMarkdown document, we try to connect story-making questions with two ideas:

  1. a Variable in a dataset
  2. A computed Quantity / Descriptive Statistic or a Visual, based on one or more Variables

So: a question identifies a variable and a question also leads to a Computation or a Data Visualization. The idea is to get the intuition behind data, and iteratively ask the questions and form hypotheses and perform Exploratory Data Analysis (EDA) using graphs and charts in R.

At some point we may find that the data is not adequate to prove/disprove a particular hypothesis and need to get into further research / experimental design. It is possible to design the research experiments also in R, but we may cover that much later.

In the following:

When it is YOUR TURN: wherever you see YOUR TURN, please respond with explanations, more questions and if you are already confident, code chunks to create new calculations and graphs. This will be one of your submissions for this module, on Teams!

6 Interrogative Pronouns for Data Variables

So how do we ask questions? These are usually with interrogative pronouns in English: What? Who? Where? Which? What Kind? How? and so on.

6.1 The penguins dataset

names(penguins) # Column, i.e. Variable names
## [1] "species"           "island"            "bill_length_mm"   
## [4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
## [7] "sex"               "year"
head(penguins) # first six rows
tail(penguins) # Last six rows
dim(penguins) # Size of dataset
## [1] 344   8
# Check for missing data
any(is.na(penguins) == TRUE)
## [1] TRUE
  1. What are the variable names()?
  2. What would be the Question you might have asked to obtain each of the variables?
  3. What further questions/meta questions would you ask to “process” that variable? ( Hint: Add another word after any of the Interrogative Pronouns, e.g. How…MANY?)
  4. Where might the answers take your story?

6.1.1 YOUR TURN-1

State a few questions after discussion with your friend and state possible variables, or what you could DO with the variables, as an answer.
E.g. Q. How many penguins? A. We need to count…rows?

6.2 Pronouns and Variables

In the Table below, we have a rough mapping of interrogative pronouns to the kinds of variables in the data:

Pronoun Answer Variable / Scale Example What Op erations?
What, Who, Where, Whom, Which Name, Place, Animal, Thing Qu alitative / Nominal Name
  • Count no. of cases

  • Mode

How, What Kind, What Sort A Manner / Method, Type or Attribute from a list, with list items in some ” o rder**” ( e.g. good, better, improved, best..) Qu alitative / Ordinal
  • So cioeconom ic-status (“lo w-income, middl e-income, hig h-income)

  • Education l evel(“hig hschool”,

    ” BS”,“MS”,

    “PhD”)

  • Income level

    (“less than 50K”,

    “5 0K-100K”, “o ver100K”)

  • Sat isfaction

rating (” extremely

dislike”, ” dislike”, ” neutral”,

“like”, ” extremely

like”).

  • Median
  • Pe rcentiles
How Many / Much / Heavy? Few? Seldom? Often? When?

Q uantities with Scale.

Diff erences are me aningful, but not products or ratios

Qua ntitative / I nterval
  • pH
  • SAT score (200-800)
  • Credit score (300-850)
  • Year of

Starting in

College

  • Mean

  • Standard

Deviation

How Many / Much / Heavy? Few? Seldom? Often? When?

Qu antities, with Scale and a Zero Value.

Di fferences and Ratios /Products are me aningful. (e.g Weight )

Qua ntitative / Ratio
  • Weight,

  • Length,

  • Height

  • Te mperature in

Kelvin

  • Enzyme

activity, dose

amount,

reaction rate, flow rate, conc entration

  • Pulse

  • Survival time

  • Co rrelation
  • Coeff of

Variation

As you go from Qualitative to Quantitative data types in the table, I hope you can detect a movement from fuzzy groups/categories to more and more crystallized numbers. Each variable/scale can be subjected to the operations of the previous group. In the words of S.S. Stevens (https://psychology.okstate.edu/faculty/jgrice/psyc3214/Stevens_FourScales_1946.pdf)

the basic operations needed to create each type of scale is cumulative: to an operation listed opposite a particular scale must be added all those operations preceding it.

Do think about this as you work with data.

Do take a look at these references:

  1. https://stats.idre.ucla.edu/other/mult-pkg/whatstat/what-is-the-difference-between-categorical-ordinal-and-interval-variables/
  2. https://www.freecodecamp.org/news/types-of-data-in-statistics-nominal-ordinal-interval-and-ratio-data-types-explained-with-examples/

6.3 The mpg dataset

names(mpg) # Column, i.e. Variable names
##  [1] "manufacturer" "model"        "displ"        "year"         "cyl"         
##  [6] "trans"        "drv"          "cty"          "hwy"          "fl"          
## [11] "class"
head(mpg) # first six rows
tail(mpg) # Last six rows
dim(mpg) # Size of dataset
## [1] 234  11
# Check for missing data
any(is.na(mpg) == TRUE)
## [1] FALSE

6.3.1 YOUR TURN-2

Look carefully at the variables here. How would you interpret say the cyl variable? Is it a number and therefore Quantitative, or could it be something else?

7 Interrogations and Graphs

We can also respond to ( more complex ) Questions, with not just a variable but one of two things:

  • A calculation, shown in a table
  • a data visualization. This visualization can even involve more than one variable, as we will see.

What sort of calculations, and visuals charts can we create with different kinds of variables, taken singly or together? Let us write some simple English descriptions of measures and visuals and see what commands they use in R.

Here we will use the Grammar of a package called ggplot, which we will encounter in Lab:04. Let us go with our intuition with the code in the following sections.

Note: since we saw a couple of missing entries in the penguins dataset, let us remove them for now.

penguins <- penguins %>% drop_na()

7.1 Single Qualitative/Categorical/ Nominal Variable

  1. Questions: Which? What Kind? How? How many of each Kind?
  • Island ( Which island ? )
  • Species ( Which Species? )
  1. Calculations: No of levels / Counts for each level
  • count / tally of no. of penguins on each island or in each species
  • sort and order by island or species
  1. Charts: Bar Chart / Pie Chart / Tree Map
  • geom_bar / geom_bar + coord_polar() / Find out!!
penguins %>% count(species)
ggplot(penguins) + geom_bar(aes(x = island))

ggplot(penguins) + geom_bar(aes(x = sex))

7.1.1 YOUR TURN-3

7.2 Single Quantitative Variable

  1. Questions: How many? How few? How often? How much?

  2. Calculations: max / min / mean / mode / (units)

  • max(), min(), range(), mean(), mode(), summary()
  1. Charts: Bar Chart / Histogram / Density
    • geom_histogram() / geom_density()
max(penguins$bill_length_mm)
## [1] 59.6
range(penguins$bill_length_mm, na.rm =TRUE) 
## [1] 32.1 59.6
summary(penguins$flipper_length_mm)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     172     190     197     201     213     231
ggplot(penguins) + geom_density(aes(bill_length_mm))

ggplot(penguins) + geom_histogram(aes(x = bill_length_mm))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

7.2.1 YOUR TURN-4

Are all the above Quantitative variables ratio variables? Justify.

7.3 Two Variables: Quantitative vs Quantitative

We can easily extend our intuition about one quantitative variable, to a pair of them. What Questions can we ask?

  1. Questions: How many of this vs How many of that? Does this depend upon that? How are they related? (Remember \(y = mx + c\) and friends?)

  2. Calculations: Correlation / Covariance / T-test / Chi-Square Test for Two Means etc. We won’t go into this here !

  3. Charts: Scatter Plot / Line Plot / Regression i.e. best fit lines

cor(penguins$bill_length_mm, penguins$bill_depth_mm)
## [1] -0.2286256
ggplot(penguins) +
  geom_point(aes(x = flipper_length_mm,
                 y = body_mass_g))

ggplot(penguins) +
  geom_point(aes(x = flipper_length_mm, 
                 y = bill_length_mm))

7.3.1 YOUR TURN-5

7.4 Two Variables: Categorical vs Categorical

What sort of question could we ask that involves two categorical variables?

  1. Questions: How Many of this Kind( ~x) are How Many of that Kind( ~y ) ?

  2. Calculations: Counts and Tallies sliced by Category

    • counts , tally
  3. Charts: Stacked Bar Charts / Grouped Bar Charts / Segmented Bar Chart / Mosaic Chart

    • geom_bar()
    • Use the second Categorical variables to modify fill, color.
    • Also try to vary the parameter position of the bars.
ggplot(penguins) + geom_bar(aes(x = island, 
                                fill = species),
                            position = "stack")

Storyline: तीन पेनगीन। और तुम भी तीन(Oh never mind!)

7.4.1 YOUR TURN-6

7.5 Two Variables: Quantitative vs Qualitative

Finally, what if we want to look at Quant variables and Qual variables together? What questions could we ask?

  1. Questions: How much of this is Which Kind of that? How many vs Which? How many vs How?

  2. Calculations: Counts, Means, Ranges etc., grouped by Categorical variable.

ggplot(penguins) + 
    geom_density(aes(x = body_mass_g, 
                 color = island, 
                 fill = island), 
                 alpha = 0.3)

  1. Charts: Bar Chart using group / density plots by group / violin plots by group / box plots by group
  • geom_bar / geom_density / geom_violin / geom_boxplot using Categorical Variable for grouping
ggplot(penguins) + 
    geom_density(aes(x = body_mass_g, 
                 color = island, 
                 fill = island), 
                 alpha = 0.3)

ggplot(penguins) + 
  geom_histogram(aes(x = flipper_length_mm,
                 fill = sex))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

7.5.1 YOUR TURN-7

7.5.2 Time to Play

  1. Create a fresh RMarkdown and similarly analyse two datasets of the following data sets

8 References

  1. Data Visualization with R, Robert Kabacoff (Good crisp descriptions of many kinds of graphs, no nonsense book. Available free on the web.)
  1. Wickham and Grolemund, R for Data Science (R Bible. Available free on the web.)
  1. The best stats you’ve ever seen | Hans Rosling

Ask me for help any time!

