Wednesday, 10 May 2017

Statistical Thesaurus update, and generating cross-references in R

The post is about an expansion on the statistics thesaurus that was started in this post. The entries in  The Oxford Dictionary of Statistical Terms, 6th edition, edited by Yadolah Dodge was an inspiration to update the thesaurus, and was used as a 'to do list' of terms to include. Compared to the Oxford Dictionary, this thesaurus is much less technical, and more cross-referenced. Perhaps with time, it could be a workable companion volume.

Because the thesaurus is digital, it makes sense that the cross-references include hyperlinks. Doing this manually is problematic, however. If I only include links to existing entries as I write them, then early entries will either be missing links, or will need be checked a reread as more entries are added. If I include links to entries that don't yet exist in order to save time, then I risk leaving dead links to entries I neglect to fill in later.

Instead, I've written this cute little R program that scans the explanations in my entries for mentions of names of other entries, and creates the appropriate hyperlinks automatically.

For example, the entry under "average" includes mentions of "mean" and "median", both of which are also entries. This program identifies these cross-references and updates the entry for "average" with HTML code to make "mean" and "median" hyperlinks to anchors at their respective entries. Furthermore, when making links, terms that include other terms in their names take priority, so a mention of "negative binomial distribution" will correctly link to that, and not to "binomial distribution" or "distribution".

Before:
Average: Colloquial term for "mean". Can also refer to any measure of centrality, such as the "median", or "trimmed mean", but much less often.

 After:
<a name="AVERAGE">Average</a>: Colloquial term for "<a href="#MEAN">mean</a>". Can also refer to any measure of centrality, such as the "<a href="#MEDIAN">median</a>", or "<a href="#TRIMMED MEAN">trimmed mean</a>", but much less often.

It's not a perfect system. It fails to make a hyperlink if the source entry text and destination entry disagree on hyphenation or Canadian/American/British spelling. Common words like "meaning" will produce false positives(although it's good practice to avoid words like "meaning" when talking about statistics anyway).. The program is, however case-insensitive in its reference discovery. So far it's catching nearly every reference and linking correctly, but there are only about 80 entries so far.

Here are the Google drive files of the original text, the resulting HTML page and the R code.

Here is an older example of the HTML page that is output for these 80 entries.

Disagreements on my thesaurus entries are encouraged. Please tell me if you think something should be added, removed, or changed. I would much rather be wrong temporarily than permanently.