## Featured post

### Textbook: Writing for Statistics and Data Science

If you are looking for my textbook Writing for Statistics and Data Science here it is for free in the Open Educational Resource Commons. Wri...

## Friday, 21 September 2018

### Open Reviews 2 - Open Journal of Statistics 2012

These were the first two papers I reviewed for  Scientific Research Publishing's Open Journal of Statistics, back in 2012. The first one is 'An Exceptional Generalization of the Poisson Distribution', and the second one was 'A Proposed Statistical Method to Explore Quality of Quantitative Data'.

It was both the first year that OJS was publishing and the first year that I was reviewing under my own name. (Like many graduate students, I had previously conducted a couple of reviews for my supervisor at the time). I didn't know exactly what to focus on, so these first reviews are more like copy-editing notes than typical academic reviews. I really tried to add value to the manuscripts.

My criterion essentially was 'is the ready to be published at all?' rather than 'is this good enough and relevant enough to be at this journal?'.

An Exceptional Generalization of the Poisson Distribution
Per-Erik Hagmark, Open Journal of Statistics, 2012, 2, 313-318  DOI: 10.4236/ojs.2012.23039
https://www.scirp.org/Journal/PaperInformation.aspx?PaperID=20668

This manuscript was a theoretical extension to the Poisson distribution for continuous data. I recommended to the editor 'accept/minor revisions'.

------------------

Abstract:

- What exactly does ‘natural-shaped’ mean?  Fitting natural phenomena?

- In “any generally possible…”, “generally possible” seems vague, would “any possible” suffice?

- “is an immediate application topic” could be shortened to “is possible.” The application is implied, I feel.

1. Introduction and the main result

- Good job putting the main result early, I thought that was a good one.

- By “different discrepancies” I think you mean “a variety of discrepancies”. ‘Different discrepancies’ implies that the Poisson distribution has its own discrepancies, but real data give different ones.

- References like  typically can be injected with building grammar around them.  For example “ , .” should be “.”

- “process: see e.g. [2,3,4,5].” should be  “process [2,3,4,5].”

- Equation 1: On page 5 you have this equation but in the opposite order, consider having it start with “sigma^2 < “ like it does at the start of Sec. 5.  Also, it should end in a comma.

- “parameter values . 7” to “parameter values .”

- “Mathematically inherent count model”  I don’t know what this means. Do you mean “mathematically coherent”?

- I keep thinking of N(mu,beta) as something from the normal distribution, but I see that it’s meant to emphasize that its values belong in the naturals. I can’t think of a better symbol or letter than N, but I wanted to mention the initial confusion.

(page 2)

- Equation 2: Pr(N(mu,beta < n)) should be Pr(N(mu,beta)< n)

- Equation 2 should end in a comma.

- “where … are the gamma probability density resp. cumulative distribution” to “ where … the gamma probability and cumulative distribution functions, respectively.”

- “Our presentation begins with” to “We begin with a”

- The last sentence of Section 1 could be changed to “In the last section, we compare N(u,B) against well-established Poisson generalizations.”

2. Derivation of two fundamental inequalities

- Commas after every line of math here except the last, which already has a period.

- “see e.g. .” Can simply be “.”

- “these simple observations adopts” to “these observations imply”.

(page 3)

- “These invariants … significantly:” possibly to “With additional work, we can imply the stronger convergences” (no colon at end).

- “Indeed, integration by parts, the basic function equations” to “Integration by parts, the equations”.

- “formula (3), l’Hospital’s rule, and other routines allow” to “formula (3) and l’Hospital’s rule allow”.  (If the routine or method isn’t important enough to have a name, it’s probably not worth mentioning at all. Save yourself a bit of word count.)

3. A mean-preserving discretization

- I was confused with equation 8 until I realized that you were defining the discretization and that Pr(N < n) is not Pr(X < n).  If other people have this confusion, I recommend changing “with the cumulative probabilities” to “with the cumulative probabilities equal to the mean CDF of the F on (n,n+1).” …or some other plain language justification for the discretization.

4. A generalization of the Poisson model

- “Indeed, one can first” to “We”

- “then Chebyshev’s inequality” to “then apply Chebyshev’s inequality”.

- By “x is both the mean and the variance” do you mean “x is IN both the mean and variance”?  Because I think the mean of the gamma distribution is tx and the variance is tx^2.

- Equation 13: I would prefer to see “&” as “and” so at a glance there is a break in the symbols and I can tell more easily that equation (13) is two equations.

- Instead of writing “H’(x,u) > 0 (x-derivative)” you may  want to try a delta/delta x  before “H’(x,u)” instead. (or a d/dx, the idea you get more precision for less space).

(page 5)

- Change equation 17 to end in a comma and “This proves (i)” to “proving (i)”.

5. Full dispersion flexibility

- “We are left with … : Given” to “Property (iii), Sec. 1 remains to be proved. Given”

- I don’t understand Figure 2. All I see are two graphs with vertical lines at 1 and different scales. Are there lines or points mu and sigma min that aren’t showing up on my pdf? Could the mu and sigma information be shown in a table?  Is “Beeta” supposed to be “Beta”?  This could be a computer issue on my part.

- Equation 21 should end in a comma.

- “and changing variable” (z = Bt) to either “making the change of variable z = Bt” or “changing the variable z to Bt”.

6. Computing and applications.

- The text before equation (23) could be shortened to “When working with N(mu,beta), the following numbers are useful: “

- Equation 23 should end in a period.

- “(13); note“ to “(13). Note”

- “software offer” to “software offers”

- “shortens basic formulas:” to “shortens the basic formulas.”

- Equations 24 and 25 should end in comma.

- I would prefer Table to be Table 1, even though there is only 1 table. Just for parallel structure.

(page 8)

- In the remark aren’t mu and sigma the population mean and variance, not the “sample mean, standard variance” ?

- “Perhaps needless to mention, but of course” to either “Recall that” or “Note that”

- “generally does not” to “typically does not”, because of what ‘generalized’ often implies.

7. Discussion and further research.

- First sentence to “The distribution N(mu, beta) has more applications than those mentioned.”

- “explored more” to “further explored”.

- “Here we share … reader”, “Two problems to be addressed in the future follow.” Unless I’m misunderstanding your intent here.

- Quality (c ) could be made more exact. Could “simple” be replaced with “intuitive”, or “no curved”.

- Quality (d ) seems subjective, but I admit to not understand what ‘natural-shaped’ is.

- “add a new parameter” to “introduce a new parameter”

- I would remove “We are aware of only one respectable exception:

- “: the general Poisson law , which…  …both lack (a): the” to “The general Poisson law  meets (b,c,d), but neither it nor the COM-Poisson distribution meet (a). The”

- Regarding the last sentence about the mean being a special function, is this true of the general Poisson, the COM-Poisson, or both?

- “It may also be mentioned” to “Also note”.

----------------------------------------

A Proposed Statistical Method to Explore Quality of Quantitative Data
Y.M. Ibrahim and A.M. Moussa, submitted to the Open Journal of Statistics in 2012.
https://www.scirp.org/journal/OJS/

This manuscript was a computational method for... something? The authors were pretty opaque about what exactly was being done, and what was being accomplished. I recommended a 'reject/major revisions'. I was skeptical that OJS would actually take my recommendation seriously, as they were probably having a hard time finding submissions, and I wasn't sure if they were a predatory journal or not yet.

General:

I like the ideas here, it’s a clever and flexible way to assess goodness of fit and identify problems.  Its simplicity is a merit and it appears to lend itself well to software development.  However, I feel this research is not yet ready for publication.

The research methods feel hidden to the reader.

Rather than using five hand-picked models and describing two of them in detail, your method of would garner more confidence if you had a method of generating models and provided a way for the reader to recreate the models that you used  in R.

This sort of transparecy is especailly important because these results rely entirely on simulation rather than proof.

There should be somewhere where it is explicitly stated why the MSE is important. Something like this:

"You have a measure of data quallity based on knowing the true values, and a low MSE shows that the quality indicators based only on the observed (training) data are close to the quality indicator that has the true value information. We only have the measured values in reality, so we need to find a combination of distance measures and nearest neighbours rules that will let us determine data quality similarly to as we had the true values.

We will simulate data modeled after common distributions, such a multivariate normal and multiple linear regression, find the best set of rules for these situations, and apply those rules to assess data quality when applying those models."

Without something like this, it's very easy to lose why the MSE is so important in these simulations.

In the applications section, you show that quality score is negatively correlated to measurement errors. I would have liked to see a lot more details about the application, and the possibility of correlation to measurement error size as an alternative rubric to MSE, especially since the correlations in the application seemed weak.

Also, have you considered a factorial or fractional factorial design to find the best combination of distance measures (Euclid, Manhattan, Chebyshev)/nearest neighbours/dij measures/quality indicators? You could be missing interaction effects or hitting local optima without one.