Featured post

Textbook: Writing for Statistics and Data Science

If you are looking for my textbook Writing for Statistics and Data Science here it is for free in the Open Educational Resource Commons. Wri...

Sunday 7 April 2019

Writing R documentation, simplified

A massive part of statistical software development is the documentation. Good documentation is more than just a help file, it serves as commentary on how the software works, includes use cases, and cites any relevant sources.

One cool thing about R documentation is that it uses a system that allows it to be put into a variety of different formats while only needing to be written once.

Getting Started

Templates for datasets and functions are available in Rstudio under New File -> R Documentation

Once you have a template, and it is saved somewhere, you can preview it.

Now how do we actually write the duc... documentation?

Documenting Datasets

First, let’s look the documentation for datasets, because it’s simpler than that of functions. Below is an example for the synthetic data on baby duck behaviour.

\title{Baby Duck Dataset}
This synthetic dataset describes a total of 
160 baby ducks in a total 25 nests. 
Variables include 
nest ID (factor), 
nest size (count), 
sex (factor),
Weight in grams (numeric), 
brightness in an arbitrary reflective index (numeric), 
noise events per hour (numeric), and 
bravery in mean millimetres from mother (numeric).
\format{A data frame of 160 cases on 7 variables.}
  Davis, J. (2019) \emph{Writing for Statisticians}.
  New York: Wiley.
Let’s break this down field by field.


\name{ducks} defines the base name of the .Rd file. There can be only one name and it has to be plain text.

\docType{data} Tells R to expect the fields specific to a dataset. (The default docType, function, does not need this to be defined. Defining an entire package DOES need the docType field)

\alias{ducks} and \alias{babyducks} define the searchable topics that this documentation file entails. You can have more than one of these.

\title{Baby Duck Dataset} defines what shows up at the top of the page
\description{} This is the full, detailed, description of the dataset. This can include some limited LaTeX code like \emph{}

\usage{ducks} This is where the detailed intended use of the function or dataset goes. This is not the same as example code. For functions, this is much more detailed. For datasets it is always just the name of the dataset.

\format{} Is a dataset specific field that defines the format (e.g. vector, matrix, time object, data frame), as well as the size of the data.

\source{} The physical source of the data if it’s real. This data set is made up, the source is merely ‘synthetic’.

\references{} The bibliographic source of the data, if there is one. Note the \emph{} LaTeX code for emphasis.

\keyword{datasets} For meta-data purposes. This is always ‘datasets’ for datasets, but is more detailed in functions.

Documenting Functions

Now, let’s consider the documentation for functions. Here’s the documentation for a couple of related functions I used for extracting continuous data information from binned data.


\title{Compute Kullbeck-Leibler Divergence from Binned Data}
  Takes the parameters of a continuous distribution, such as the exponential, or the log-normal, and a binned distribution, and estimates the Kullbeck-Liebler Divergence between the two distributions.
  \item{x}{a vector of the parameter values of the continuous distribution, in the order that appear in their respective pdist() functions.}
  \item{cutoffs}{A vector of length (b+1) boundaries that define the b bins.}
  \item{group_prob}{A vector of length b defining the proportion of observations in each bin.}



# Input some observe frequencies and bins.
group_freq <- c(114,76,58,51,140,107,77,124,42)
cutoffs <- c(0,25,50,75,100,250,500,1000,5000,20000)
group_prob <- group_freq / sum(group_freq)

### Get initial estimates for exponential distribution
midpoints = 1/2 * (cutoffs[-1] + cutoffs[-length(cutoffs)])
init_par_exp <- sum(midpoints * group_prob)

## Find the KL Divergence at the initial estimates
get.KL.div.exp(init_par_exp, cutoffs, group_prob)

## Find optimal parameter values
## Which will have the lowest KL Divergence
result_exp <- optim(par = init_par_exp,
               fn = get.KL.div.exp,
               method = "Brent", lower=1, upper=10000,
               cutoffs=cutoffs, group_prob=group_prob)
result$par ## This is the parameter for the distribution
result$value ## This is the KL Divergence at the best fit

Let’s break this down field by field again.

\name{get.KL.div} defines the base name of the .Rd file.

\alias{} defines the searchable topics, in this case the three related functions.

\title{} self-explanatory.

\description{} Describes what the function(s) does/do in human-readable terms.

\usage{} The intended use of the function(s). If there were defaults to any of the arguments, they would be specified as ‘argument = default’ (example: x=5). If there was a limited set of options, they would be shown in the argument list as ‘argument = c(“Option 1”,”Option 2”,”Option 3”)’ (example: distribution = c(“Exponential”,”Log-Normal”,”Gamma”)), where the first option listed is the default.

\arguments{} Describes in human-readable terms what each of the arguments does. Note that they are put into a list format with the \item{argument} commands.

\examples{} Gives at least one self-contained example of how the function is used, if such an example is appropriate. Note that the raw data used by the function is defined in the example. In this example, most of the code is also put in a \dontrun{} wrapper which allows the code to be displayed in the documentation and readily copy-pasted, but prevents the code from running if someone were to call it with the example() function.

\keyword{bins} This describes the general category for this function. Keywords are not required for functions, and more than one keyword is allowed.

For much more information on writing R documentation, please see the original CRAN guide: Full details at: https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Writing-R-documentation-files

Saving your Results

Rstudio can make previews, but it’s a little harder save the R document files into anything you can view outside of R. (You can call them in R with the ?[name] command, but they need to be installed to the right location first)

To turn your .Rd file into something you can share, run code like this example code.

This will tell the system “run Rd conversion, convert to HTML, as a file Testing1.html, using the file testing1.Rd”. The .Rd file needs to be in the directory that you have set with setwd() (or set manually with point and click)Both the .Rd file you use and the .html file you output 

system("R CMD Rdconv --type=html --output=ducks.html ducks.Rd")

To see the details of what the R system command “Rdconv” can do, run ‘help’.

system("R CMD Rdconv --help")

Here, it shows the other options for ‘type’, such as txt for plain ASCII text, LaTeX output (for including formulae), and example if you just want the examples from your documentation.

There is one other format, PDF, which is generated using Rd2pdf instead of Rdconv. Rd2pdf also takes in multiple Rd files at once because it’s designed to provide documentation for an entire package. 

See this R webpage for more information.

Style guidelines

There are further style and compatibility guidelines here:  https://developer.r-project.org/Rds.html
But a few that you be especially aware of are…

  • Titles should be plain text, without markup (formatting code like bold)
  • Every argument that the function (or functions) use should be described in the argument list, including the … for any functions that will be passed on it other functions inside the function you’re documenting.
  • Use British spelling. (things that end in -ise or -our)
  • Avoid ampersands ( & ) unless quoting.
  • Don’t use tabs to indent ( for compatibility across different systems)
  • Use \dQuote instead of quotation marks (to prevent R parsing errors)
  • Make your examples as self-contained as possible. Don’t depend on internet access or certain operating systems.
  •  Use TRUE and FALSE, not T and F
  •  Use <- , not =
Finally Chica can explain how her SnugR function works.

    No comments:

    Post a Comment