## Featured post

### Textbook: Writing for Statistics and Data Science

If you are looking for my textbook Writing for Statistics and Data Science here it is for free in the Open Educational Resource Commons. Wri...

## Monday, 12 November 2018

### Four OJS manuscript reviews, 2015-2018

Here is a dump of the remaining reviews I made for Scirp's Open Journal of Statistics from 2015 to 2018. For reasons explained in The Last Review I'll Ever Do For OJS, I won't provide additional linking information. These reviews are here as how-to examples.

#### Paper 1 - An applications paper on facial recognition machine learning.

I enjoyed this paper and found it to be of at least the quality and within the scope of methodological work that OJS has published in the past. I recommend that it be accepted for publication with only a few minor edits, as listed below.
The mathematics seems watertight. The flow was logical and complete, and I failed to recognize any errors. As such, most of my comments pertain to the writing and results.

Concerns:

My main concern is the limited scope of the training set. How well does ZNMF compare to CNMF for larger sets of faces? Does the required information of ZNMF begin to creep up to that of CNMF as n increases? Does the average recognition rate drop? Is 644 pixels a realistic resolution for modern images of faces?

Using another database is mentioned at the discussion, and I encourage the authors to further explore this in another paper.

Also, There appears to be a mismatch with your references. In several places, reference [4] is called 'Wang et al.', and is mentioned in reference to resolution bounds. However, in my copy of the paper [4] has Huang, X. as the first author, and pertains to discussion messages, not images. [3] and [5], however, do fit these criteria. Please go through each of your references to ensure that they match their stated numbers.

Edits:
•  Spelling mistake between (9) and (10): updats
•  Between (22) and (23), I think it's 'benchmark', not 'bench mark'.
• I would recommend rewriting the sentence  "The Cambridge ORL database consists of 400 gray-scale facial images from 40 predominantly male subjects (10 images per subject) as "    The Cambridge ORL database consists of 10 gray-scale facial images each of XX male and XX female subjects. ", which is shorter and more precise.
• The "size" of your training dataset is ambiguous. Please consider changing the sentence portion "subjects resulting in a training dataset of size 200" to "subjects resulting in a training dataset of 200 images of 644 pixels each".
• In "ZNMF simulations, were than held constant ", 'than' should be 'then'.

#### Paper 2 - A compression method that used an 'expanded alphabet'.

The 'expanded alphabet' was the author's term for n-grams. This paper was well written, but the author had mistakenly thought they were discovering something new in the idea of conditional information.

The premise of this paper is that a message can be compressed beyond the Shannon entropy of that message.

However, the definition of entropy, as given in the proof in Equation 3 of the manuscript, is the unconditional entropy. That is, it only works as a minimum length for messages where the probability of one symbol does not depend on the symbols before it. In cases where one letter informs other letters, as in natural language, the fully conditional entropy is the lower bound.

Using the author's own reference [9, information theory, inference and learning algorithms by d mackay], this is shown in the definition of the Joint Entropy in equation 2.38 on page 33. Alternatively, P. 16-18 of Elements of Information Theory, 2nd ed. by Thomas and Cover also explains this.

This is demonstrated simply in practice: Consider the file that the author uses, Alice29.txt of the Canterbury Corpus. The author claims that "the optimal message encoding length [is approximately] 671214 bits", and that he can reduce this to 437611 bits. The .zip version of the Alice29.txt file available on the Canterbury Corpus is a compressed 435168 bits. To me this implies that the author has recreated a compression method that is already in common use by existing software.
I would like to offer some words of encouragement for future research, however: This manuscript is very well written, is clear, complete, and has very few grammatical issues. Now that the author has the source code written for a compression method that is already good enough for commercial software, there is hope for some novel improvements. Specifically, the author mentions in the discussion section a dictionary approach that would allow for encoding of whole words (perhaps including the spaces after them). If a message can be detected as being natural language, this could bring the substantial improvements advertised. I would recommend seeing the British National Corpus for the relative frequencies of words and working from there.

#### Paper 3 - An applications paper on meteorology.

This paper wasn't a statistics paper. It was a meteorology paper that happened to use some statistics, and not very well.

The focus of Prediction of Daily Reservoir Inflow using Atmospheric Predictors supplies a novel way to predict the inflow into a water reservoir on a daily basis rather than a seasonal one using weather information.
However, I feel that there is not enough statistical rigour or novelty to warrant publication in the Open Journal of Statistics. I would encourage the authors to submit this manuscript to the Open Journal of Modern Hydrology, also under the Scientific Research Publishing administration, at http://www.scirp.org/journal/ojmh/ .

Details on issues, corrections, and improvements follow:

In the introduction, a GLM is described as 'a classical structure weather generator within a…'. A Generalized Linear Model is NOT a weather generator, simulated or otherwise. It is a generalization of a linear regression to allow it to be used for binary or count data.

In equation (6), you have described the log-likelihood, not the likelihood. Also, the

ln( f( y; theta)) = sum(i=1..n) ln( fi(y;theta))

… note that the ln() function carries through into the sum.

This doesn't change anything else that you're doing, because maximizing the log-likelihood is the same as maximizing the likelihood. I would describe your statistical model as such:

First, the determination of the day as wet or dry is a binary response, so you have applied a logistic regression (that's the name for a GLM for a yes/no variable), and the amount of rain as coming from an exponential distribution.

You describe the amount of rainfall as a gamma distribution, but the gamma has two parameters, a mean parameter mu, which you have, and another. Is the other parameter fixed, or is it also estimated? See https://en.wikipedia.org/wiki/Gamma_distribution

Is the exponential distribution, a special case of gamma, that you are fitting? See: https://en.wikipedia.org/wiki/Exponential_distribution

In Table 1, would this information be better represented by a Q-Q plot? When you are trying to test if two variables have the same distribution, you can also use a goodness-of-fit test such as Anderson-Darling. Have a look at the R package gof, or anything derived from it that's more up-to-date. See:  http://www.inside-r.org/packages/cran/gof

In Figure 4, it is unclear which line represents the observed rainfall. Can you remake this Figure 4 in R, like Figures 2 and 3, and show the observed rainfall as a line, and, say, the range of the middle 90% or 95% of your simulations as a band?

I wish you the best of luck in future submissions.

#### Paper 4 - Applications Paper on Online Marketing

A paper with solid methodology, and with some intriguing but unrealized potential.

There are a few points that confuse me, and it may be a oversight or an error in the typesetting.
Should log_2 = P(A,B) / P(A)P(B) be something like I = log_2 ( P(A,B) / P(A)P(B) )
where I is the information measure?

It cannot possibly the formula that is shown because log_2 isn't a meaningful value. Log_2 of what?

“The evaluation data . . . 1471 . . . 1800”. What evaluation metric is being used? AIC, or log-likelihood, for example, will give a large number for a larger dataset, so the number itself doesn't tell much without context. What do those numbers mean?

Likewise, in Table 1, an emotional evaluation score is given for each store. Is a higher score good?

Does the possible score range from 0 to 1000? Is 500 neutral? Are the differences in the values of the two stores statistically significant?

I would have been interested in seeing an example of the original chinese text being broken down and
analyzed. How does text analysis parse Chinese characters? To me, this is the most valuable part of this paper, is that the sentiment analysis research is being applied to Chinese text.

In general, more definitions and clarifications are necessary. What does TD-IDF stand for and why does it work?

Overall, I think the research itself is good and fit to be published.