class: center, middle, inverse, title-slide # Programming Tools in Data Science ## Lecture #10: R package ### Samuel Orso ### 22 November 2022 --- # Why making an R package? * Distributing code with other users. * Forces to follow strict coding convention and work processes. * Stability of the code with longer term maintenance and testing. * Ease of use when accumulating many functions. --- # Setup * You will need (at least) the following packages: ```r install.packages(c("devtools", "knitr", "pkgdown", "roxygen2", "testthat")) ``` * Make sure your system is ready! ```r devtools::has_devel() ``` ``` ## Your system is ready to build packages! ``` (otherwise visit <https://r-pkgs.org/setup.html>) --- class: sydney-blue, center, middle # Demo --- # DESCRIPTION file * DESCRIPTION contains metadata of your package (authors, description, dependencies, contact, ...) * It should look like ```r Package: pkgtest Type: Package Title: What the Package Does (Title Case) Version: 0.1.0 Authors@R: person("John", "Doe", email = "john.doe@example.com", role = c("aut", "cre")) Maintainer: The package maintainer <yourself@somewhere.net> Description: More about what it does (maybe more than one line) Use four spaces when indenting paragraphs within the Description. License: MIT + file LICENSE Encoding: UTF-8 LazyData: true URL: https://https://github.com/ptds2022/pkgtest BugReports: https://https://github.com/ptds2022/pkgtest/issues RoxygenNote: 7.1.2 ``` --- * Use the `person` function for `Authors@R`, role includes: a. `"cre"`: (creator) for package maintainer; b. `"aut"`: (author) those who made substantial contributions to the package; c. `"ctb"`: (contributor) those who made smaller contribution; d. `"cph"`: (copyright holder) used for legal name for an institution or corporate body. * `License`: since the point of a package is to be distributed to others, you need to [choose a licence](https://choosealicense.com/licenses/). For example, [MIT](https://choosealicense.com/licenses/mit/) is permissive and can be called ```r usethis::use_mit_license() ``` --- # Dependencies * DESCRIPTION lists all the packages needed for your package to work. * `Depends` specifies the version of `R`; e.g. ```r Depends: R (>= 4.0.0) # don't forget the space! ``` * `Imports` lists the package that must be present (best practice is to write `pkg::fct()`); for example, suppose you need `ggplot2` and `dplyr` ```r Imports: dplyr (>= 1.0.7), ggplot2 (>= 3.3.5) ``` Versioning ensures that users have the right version of the package. * `Suggests` lists packages that can be used (for vignettes, test, datasets,...) but are not required. --- # Documenting your package * Documentation appears in the `man/` (manual) subfolder as `*.Rd` files. * We will generate documentation automatically using `roxygen2`. * You can either use `devtools::document()` or maybe simpler <img src="images/roxygen2.png" width="1439" style="display: block; margin: auto;" /> --- * It uses the syntax `#'` with tags `@` and is placed right before functions, e.g. ```r #' @title hello world function #' @return print a message #' @export hello <- function() { print("Hello, world!") } ``` * Main tags should for functions are `@title`, `@param`, `@author`, `@seealso`, `@details`, `@examples`, `@return` (click [here](https://r-pkgs.org/man.html) for more details) * **All** functions should be documented. **Some** should be exported (`#' @export`) * **Do repeat yourself** --- .pull-left[ <img src="images/pkgtest_hello_world.png" width="791" style="display: block; margin: auto;" /> ] .pull-right[ ```r #' `@title` hello world function #' `@author` John Doe #' `@details` #' A super fancy function to print Hello World! #' `@return` print a message #' `@examples` #' \dontrun{hello()} #' `@export` hello <- function() { print("Hello, world!") } ``` ] --- # Adding data * It is common to add data to a package. * Data should be placed in `data/` folder. * It is recommended to add data in the form of `*.rda` file. * Easiest way to achieve that is using the command `usethis::use_data()` --- # Data preparation * Most of the data you will want to add is not in the `*.rda` format. * You may have some raw data that will require some manipulation prior to obtaining the final clean data that will be made available to users. * It is highly recommended to keep the raw data and the code used for data wrangling. * Easiest way to achieve that is by using the command `usethis::use_data_raw()`, it creates a new folder `data-raw/` which is added to `.Rbuildignore`. --- # R Packages, Hadley Wickham and Jenny Bryan [8.2.1 Preserve the origin story of package data](https://r-pkgs.org/data.html#sec-data-data-raw) > ggplot2: A cautionary tale > > We have a confession to make: the origins of many of ggplot2's example datasets has been lost in the sands of time. In the grand scheme of things, this is not a huge problem, but maintenance is certainly more pleasant when a package's assets can be reconstructed de novo and easily updated as necessary. --- # A simple example * Suppose you want to make the `supermarket_sales.csv` data available to the users. * One straightforward way to achieve that is with this code: ```r ## code to prepare `supermarket_sales.csv` dataset supermarket <- read.csv(file = "data-raw/supermarket_sales.csv") usethis::use_data(supermarket, overwrite = TRUE) ``` * The code is placed in `data-raw/` folder and kept for future usage (but omitted from package building). --- # Documenting dataset * There are two tags useful for documenting a dataset: * `@format` provides an overview of the dataset, * `@source` gives details on where the data was obtained. ```r #' Supermarket sales data from Kaggle #' #' @format ## `supermarket` #' A data frame with 1,000 rows and 17 columns: #' \describe{ #' \item{Invoice.ID}{Computer generated sales slip invoice identification number} #' ... #' } #' @source <https://www.kaggle.com/datasets/aungpyaeap/supermarket-sales> "supermarket" ``` --- # `.Rbuildignore` * `.Rbuildignore` is the analog of `.gitignore` for `R` package: it is where you can specify files and folders that should be ignored when building a package. ```r ^.*\.Rproj$ ^\.Rproj\.user$ ^LICENSE\.md$ ^\.github$ ^data-raw$ ``` --- # Vignettes * A vignette is a RMarkdown document that provides more insights into your package. * Simply call `usethis::use_vignette("my-vignette")` to create `my-vignette`. * Add required packages in DESCRIPTION under `Suggests` --- # Namespace > Writing R extension, [Sec. 1.5](https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Package-namespaces) > > The namespace controls the search strategy for variables used by functions in the package. If not found locally, R searches the package namespace first, then the imports, then the base namespace and then the normal search path (so the base namespace precedes the normal search rather than being at the end of it). * NAMESPACE is generated automatically by `roxygen2` --- # Testing with examples * Testing ensures that your code is good and pays-off in the long-run. * Examples are good way to make sure the function work and are displayed to the user. * You can put more complex examples in `inst/examples/my_example.R` and test it using `@example inst/examples/my_example.R` --- # Example In `R/` ```r #' @title Compute regression coefficients #' @param x design \code{matrix} #' @param y \code{vector} of responses #' @details #' Compute the regression coefficients using \link[stats]{lm}. #' @importFrom stats lm coef #' @seealso \code{\link[stats]{lm}}, \code{\link[stats]{coef}} #' @example /inst/examples/eg_reg_coef.R #' @export `%c%` <- function(y, x) { fit <- lm(y ~ x) coef(fit) } ``` In `/inst/examples/eg_reg_coef.R` ```r ## linear regression cars$speed %c% cars$dist ``` --- If you click on `check` <img src="images/pkg_check.png" width="606" height="505" /> --- Now suppose there is a mistake in the code, for instance in `/inst/examples/eg_reg_coef.R` ```r ## linear regression cars$speed %c% cars ``` <img src="images/pkg_check2.png" width="600" height="500" /> --- # Testing with `testthat` * Examples help to detect errors in the code, but their primary goals is informative for the users. * Examples are displayed to the users and concerns final end functions. * It is good practice to have broader and automated tests. * We are going to use `testthat`. Simply call `usethis::use_testthat()`. * When should you test a function? > Whenever you are tempted to type something into a print statement or a debugger expression, write it as a test instead. — Martin Fowler --- # Structure of `testthat` * `testthat` is organised hierarchically: 1. An **expectation**: it is a single test using `expect_some_fct`, these are functions that test an expression and throw an error if the result disagree with what was expected. 2. A **test**: regroup one or several **expectations** and is created with `test_that`. 3. A **test file**: regroup one or several **test**. It is an `R` file and its name and structure conventions follows this example: `tests/testthat/test_something.R`. --- For example, the file `tests/testthat/test_reg_coef.R` ```r test_that("regression coefficient input check",{ expect_error(cars$speed %c% cars) }) test_that("regression coefficient output",{ expect_type(cars$speed %c% cars$dist, "double") }) ``` --- # Automated checking * It is not because you and your team does not experiment any bug that everything is okay. * `R` users have different configurations, different OS. * It is good practice to use GitHub actions: every time you push changes to the main repo, GitHub launches some action according to your spec. * To begin with, use `usethis::use_github_action_check_standard()` * More examples are displayed at <https://github.com/r-lib/actions/tree/master/examples> --- and if everything passes <img src="images/github_action.png" width="1413" /> --- # pkgdown <img src="images/pkgdown.png" style="height:150px; width:150px; position:absolute; top:7%; right:5%;"/> * It is quick and automated way to create a website around your package. * To build your first website, this is as simple as ```r # Run once to configure your package to use pkgdown usethis::use_pkgdown() # Run to build the website pkgdown::build_site() ``` * It is also a good idea to add a Github action: ```r usethis::use_github_action("pkgdown") ``` * Checkout <https://pkgdown.r-lib.org/> for more details. --- <img src="images/pkgdow_pkgtest.png" width="2273" style="display: block; margin: auto;" /> --- Find all the code presented here: <https://github.com/ptds2022/pkgtest> --- # To go further * More details and examples in the book [An Introduction to Statistical Programming Methods with R](https://smac-group.github.io/ds/section-r-packages.html) * More material and details in [R Packages](https://r-pkgs.org/). * A lot of details (really!) in [Writing R extension](https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Creating-R-packages) --- class: sydney-blue, center, middle # Question ? .pull-down[ <a href="https://ptds.samorso.ch/"> .white[<svg viewBox="0 0 384 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M369.9 97.9L286 14C277 5 264.8-.1 252.1-.1H48C21.5 0 0 21.5 0 48v416c0 26.5 21.5 48 48 48h288c26.5 0 48-21.5 48-48V131.9c0-12.7-5.1-25-14.1-34zM332.1 128H256V51.9l76.1 76.1zM48 464V48h160v104c0 13.3 10.7 24 24 24h104v288H48z"></path></svg> website] </a> <a href="https://github.com/ptds2021/"> .white[<svg viewBox="0 0 496 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"></path></svg> GitHub] </a> ] --- # Exercise * Create a package from RStudio new project with the following function: ```r `%c%` <- function(y, x) { fit <- lm(y ~ x) coef(fit) } ``` * Modify the DESCRIPTION: add an author, a license, dependencies, ... * Document the function using roxygen2 (verify your `Build tools` options). * Add the `supermaket` dataset from HW2 to the package (keep the raw data, create a `.rda` file, document the dataset). * Construct a vignette. * Add examples on how to use the function. * Add tests with `testthat`. --- * Add automated check with GitHub action. * Create a website with `pkgdown` and add a GitHub action to build the website.