Abstract
Part of my online courseR for Artists and Designers
to
teach R using Metaphors and Code.
At the end of this Lab session, we will be able to:
tsibble
data formatTBW. To be written up.
Let us inspect what datasets are available in the package timetk. Type data(package = “timetk”) in your Console to see what datasets are available.
Let us choose the Walmart Sales dataset. See here for more details: Walmart Recruiting - Store Sales Forecasting |Kaggle
## Rows: 1,001
## Columns: 17
## $ id <fct> 1_1, 1_1, 1_1, 1_1, 1_1, 1_1, 1_1, 1_1, 1_1, 1_1, 1_1, 1_…
## $ Store <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Dept <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Date <date> 2010-02-05, 2010-02-12, 2010-02-19, 2010-02-26, 2010-03-…
## $ Weekly_Sales <dbl> 24924.50, 46039.49, 41595.55, 19403.54, 21827.90, 21043.3…
## $ IsHoliday <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
## $ Type <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A…
## $ Size <dbl> 151315, 151315, 151315, 151315, 151315, 151315, 151315, 1…
## $ Temperature <dbl> 42.31, 38.51, 39.93, 46.63, 46.50, 57.79, 54.58, 51.45, 6…
## $ Fuel_Price <dbl> 2.572, 2.548, 2.514, 2.561, 2.625, 2.667, 2.720, 2.732, 2…
## $ MarkDown1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ MarkDown2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ MarkDown3 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ MarkDown4 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ MarkDown5 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ CPI <dbl> 211.0964, 211.2422, 211.2891, 211.3196, 211.3501, 211.380…
## $ Unemployment <dbl> 8.106, 8.106, 8.106, 8.106, 8.106, 8.106, 8.106, 8.106, 7…
The data is described as:
A tibble: 9,743 x 3
NOTE: 1. This is still a data.frame, with a time-oriented variable of course, and not yet a time-series object. Note that this data frame has the YMD columns repeated for each Dept. 2. The Date column has repeated entries for each Dept! To deal with this repetition, we will always need to split the Weekly_Sales by the Dept column before we plot or analyze.
Since our sales are weekly, we will convert Date to yearweek format:
The easiest way is to use autoplot from the feasts package. You may need to specify the actual measured variable, if there is more than one numerical column:
The R package timetk
gives us interactive plots that may
be more evocative than the static plot above. The basic plot function
with timetk
is plot_time_series
. There are
arguments for the date variable, the value you want to plot, colours,
groupings etc.
Let us explore this dataset using timetk
, using our
trusted method of asking Questions:
Q.1 How are the weekly sales different for each Department?
There arenumber of Departments. So we should be fine plotting them and also facetting with them, as we will see in a bit:
Q.2. What do the sales per Dept look like during the month of December (Christmas time) in 2012? Show the individual Depts as facets.
We can of course zoom into the interactive plot above, but if we were to plot it anyway:
Clearly the “unfortunate” Dept#13 has seen something of a Christmas drop in sales, as has Dept#38 ! The rest, all is well, it seems…
Too much noise? How about some averaging?
Q.3 How do we smooth out some of the variations in the time series to be able to understand it better?
Sometimes there is too much noise in the time series observations and
we want to take what is called a rolling average. For this we
will use the function timetk::slidify
to create an
averaging function of our choice, and then apply it to the time series
using regular dplyr::mutate
. Let’s take the average of
Sales for each month in each Department. Our function
will be named “rolling_avg_month”:
OK, slidify creates a function! Let’s apply it to the Walmart Sales time series…
Graphs are smoother now. Need to check whether the averaging was done
on a per-Dept basis…should we have had a group_by(Dept)
before the averaging, and ungroup()
before plotting? Try it
!!
Each data point (\(Y_t\)) at time \(t\) in a Time Series can be expressed as either a sum or a product of 4 components, namely, Seasonality(\(S_t\)), Trend(\(T_t\)), Cyclic, and Error(\(e_t\)) (a.k.a White Noise).
Decomposing non-seasonal data means breaking it up into trend and irregular components. To estimate the trend component of a non-seasonal time series that can be described using an additive model, it is common to use a smoothing method, such as calculating the simple moving average of the time series.
timetk
has the ability to achieve this: Let us plot the
trend, seasonal, cyclic and irregular aspects of Weekly_Sales for Dept
38:
We can do this for all Dept using fable
and
fabletools
:
Let us try the flights dataset from the package
nycflights13
. Try
data(package = "nycflights13")
in your Console.
We have the following datasets in nycflights13
:
Let us analyze the flights data:
## Rows: 336,776
## Columns: 19
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
We have time-related columns; Apart from year, month, day we have
time_hour
; and time-event numerical data such as
arr_delay
(arrival delay) and dep_delay
(departure delay). We also have categorical data such as
carrier
, origin
, dest
,
flight
and tailnum
of the aircraft. It is also
a large dataset containing 330K entries. Enough to play with!!
Let us replace the NAs in arr_delay
and
dep_delay
with zeroes for now, and convert it into a
time-series object with tsibble
:
Let us proceed with our questions:
Q.1. Plot the monthly average arrival delay by carrier
Q.2. Plot a candlestick chart for total flight delays by month for each carrier