Author

Lim Li Ying

Visual Statistical Analysis with ggstatsplot

Getting Started

Installing and loading R packages

In this exercise, other than tidyverse, ggstatsplot will also be used. ggstatsplot is an extension of the ggplot2 package and allows details from statistical tests to be added to the information-rich plots.

Click to show/hide the code
pacman::p_load(tidyverse, ggstatsplot)

Loading the data

Click to show/hide the code
exam_data <- read_csv("data/Exam_data.csv")
Rows: 322 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): ID, CLASS, GENDER, RACE
dbl (3): ENGLISH, MATHS, SCIENCE

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Click to show/hide the code
exam_data
# A tibble: 322 × 7
   ID         CLASS GENDER RACE    ENGLISH MATHS SCIENCE
   <chr>      <chr> <chr>  <chr>     <dbl> <dbl>   <dbl>
 1 Student321 3I    Male   Malay        21     9      15
 2 Student305 3I    Female Malay        24    22      16
 3 Student289 3H    Male   Chinese      26    16      16
 4 Student227 3F    Male   Chinese      27    77      31
 5 Student318 3I    Male   Malay        27    11      25
 6 Student306 3I    Female Malay        31    16      16
 7 Student313 3I    Male   Chinese      31    21      25
 8 Student316 3I    Male   Malay        31    18      27
 9 Student312 3I    Male   Malay        33    19      15
10 Student297 3H    Male   Indian       34    49      37
# ℹ 312 more rows

One-sample test using gghistostats()

The code chunk below uses gghistostats() to create a visual of a one-sample test on English scores.

Click to show/hide the code
set.seed(1234)

gghistostats(
  data = exam_data,
  x = ENGLISH,
  type = "bayes",
  test.value = 60,
  xlab = "English scores"
)

Note

Set seed is important especially when doing variance statistics in order to ensure reproducibility.

Two-sample mean test using ggbetweenstats()

The code chunk below uses ggbetweenstats() to create a visual of a two-sample mean test of Maths scores by gender.

Click to show/hide the code
ggbetweenstats(
  data = exam_data,
  x = GENDER,
  y = MATHS,
  type = "np",
  messages = FALSE
)

Note

There are 4 types of statistical approach:

  1. p <- parametric

  2. np <- nonparametric

  3. r <- robust

  4. b <- bayes

Oneway ANOVA test using ggbetweenstats()

The code chunk below uses ggbetweenstats() to create a visual of a oneway ANOVA test of English scores by race.

Click to show/hide the code
ggbetweenstats(
  data = exam_data,
  x = RACE,
  y = ENGLISH,
  type = "p",
  mean.ci = TRUE,
  pairwise.comparisons = TRUE,
  pairwise.display = "s",
  p.adjust.method = "fdr",
  messages = FALSE
)

Note

For pairwise display,

  • s <- displays only significant pairwise comparisons

  • ns <- displays only non-significant pairwise comparisons

  • all <- displays all pairwise comparisons

Significant test of correlation using ggscatterstats()

The code chunk below uses ggscatterstats() to create a visual of a Significant Test of Correlation between Maths and English scores.

Click to show/hide the code
ggscatterstats(
  data = exam_data,
  x = MATHS,
  y = ENGLISH,
  marginal = FALSE
)

Significant test of association (dependence) using ggbarstats()

Firstly, the Maths scores are binned into 4 groups using cut().

Click to show/hide the code
exam_data1 <- exam_data %>%
  mutate(MATHS_bins =
  cut(MATHS, breaks = c(0,60,75,85,100))
)

The code chunk below uses ggbarstats() to create a visual of a Significant Test of Association between Maths scores and gender.

Click to show/hide the code
ggbarstats(
  data = exam_data1,
  x = MATHS_bins,
  y = GENDER
)

Visualizing Models

Getting Started

Installing and loading R packages

The code chunk below shows the R packages to be installed and loaded into the R environment for this exercise.

Click to show/hide the code
pacman::p_load(readxl, performance, parameters, see)

Loading the data

To import the excel worksheet ToyotaCorolla.xls, read_xls() of the readxl package is used, as seen in the code chunk below.

Click to show/hide the code
car_resale <- read_xls("data/ToyotaCorolla.xls")
car_resale
# A tibble: 1,436 × 38
      Id Model    Price Age_08_04 Mfg_Month Mfg_Year     KM Quarterly_Tax Weight
   <dbl> <chr>    <dbl>     <dbl>     <dbl>    <dbl>  <dbl>         <dbl>  <dbl>
 1    81 TOYOTA … 18950        25         8     2002  20019           100   1180
 2     1 TOYOTA … 13500        23        10     2002  46986           210   1165
 3     2 TOYOTA … 13750        23        10     2002  72937           210   1165
 4     3  TOYOTA… 13950        24         9     2002  41711           210   1165
 5     4 TOYOTA … 14950        26         7     2002  48000           210   1165
 6     5 TOYOTA … 13750        30         3     2002  38500           210   1170
 7     6 TOYOTA … 12950        32         1     2002  61000           210   1170
 8     7  TOYOTA… 16900        27         6     2002  94612           210   1245
 9     8 TOYOTA … 18600        30         3     2002  75889           210   1245
10    44 TOYOTA … 16950        27         6     2002 110404           234   1255
# ℹ 1,426 more rows
# ℹ 29 more variables: Guarantee_Period <dbl>, HP_Bin <chr>, CC_bin <chr>,
#   Doors <dbl>, Gears <dbl>, Cylinders <dbl>, Fuel_Type <chr>, Color <chr>,
#   Met_Color <dbl>, Automatic <dbl>, Mfr_Guarantee <dbl>,
#   BOVAG_Guarantee <dbl>, ABS <dbl>, Airbag_1 <dbl>, Airbag_2 <dbl>,
#   Airco <dbl>, Automatic_airco <dbl>, Boardcomputer <dbl>, CD_Player <dbl>,
#   Central_Lock <dbl>, Powered_Windows <dbl>, Power_Steering <dbl>, …

Multiple linear regression modelling using lm()

The following code chunk is uses lm() to build a multiple linear regression model.

Click to show/hide the code
model <- lm(Price ~ Age_08_04 + Mfg_Year + KM + 
              Weight + Guarantee_Period, data = car_resale)
model

Call:
lm(formula = Price ~ Age_08_04 + Mfg_Year + KM + Weight + Guarantee_Period, 
    data = car_resale)

Coefficients:
     (Intercept)         Age_08_04          Mfg_Year                KM  
      -2.637e+06        -1.409e+01         1.315e+03        -2.323e-02  
          Weight  Guarantee_Period  
       1.903e+01         2.770e+01  

Model diagnostics: checking for multicollinearity

Multicollinearity can be checked using check_collinearity() of the performance package, as seen in the code chunk below.

Click to show/hide the code
check_collinearity(model)
# Check for Multicollinearity

Low Correlation

             Term  VIF     VIF 95% CI Increased SE Tolerance Tolerance 95% CI
               KM 1.46 [ 1.37,  1.57]         1.21      0.68     [0.64, 0.73]
           Weight 1.41 [ 1.32,  1.51]         1.19      0.71     [0.66, 0.76]
 Guarantee_Period 1.04 [ 1.01,  1.17]         1.02      0.97     [0.86, 0.99]

High Correlation

      Term   VIF     VIF 95% CI Increased SE Tolerance Tolerance 95% CI
 Age_08_04 31.07 [28.08, 34.38]         5.57      0.03     [0.03, 0.04]
  Mfg_Year 31.16 [28.16, 34.48]         5.58      0.03     [0.03, 0.04]
Note

Variables with a Variance Inflation Factor (VIF) score of >10 are considered to be highly correlated. In this case, age and manufacturing year have high correlation. Therefore one of the variables will need to be removed from subsequent models.

Click to show/hide the code
check_c <- check_collinearity(model)
plot(check_c)
Variable `Component` is not in your data frame :/

Model diagnostic: checking normality assumption

The normality assumption can be checked using check_normality() of the performance package, as seen in the code chunk below.

Click to show/hide the code
# build another model after removing "Mfg_Year"
model1 <- lm(Price ~ Age_08_04 + KM + 
              Weight + Guarantee_Period, data = car_resale)
check_n <- check_normality(model1)
plot(check_n)

Model diagnostic: checking for homogeneity of variances

Homogeneity of variances can be checked using check_heteroscedasticity() of the performance package, as seen in the code chunk below.

Click to show/hide the code
check_h <- check_heteroscedasticity(model1)
plot(check_h)

Model diagnostic: complete check

We can also perform a complete check using check_model().

Click to show/hide the code
check_model(model1)

Visualizing regression parameters using see

Click to show/hide the code
plot(parameters(model1))

Visualizing regression parameters using ggcoefstats()

Click to show/hide the code
ggcoefstats(model1,
            output = "plot")