Featured post

Textbook: Writing for Statistics and Data Science

If you are looking for my textbook Writing for Statistics and Data Science here it is for free in the Open Educational Resource Commons. Wri...

Friday 28 September 2018

Open Reviews 3 - Open Journal of Statistics 2015

 This was the third paper I reviewed for the predatory publisher Scientific Research Publishing's Open Journal of Statistics.

The manuscript was a simulation study of a new computational method. It was the first paper I had reviewed in three years, as I had been otherwise swamped in coursework. You can see that I still wasn't clear on the differences of the roles of copy-editor and reviewer by the extensive writing feedback I gave. By word count, the review was a third as long as the manuscript itself.




Open Reviews 1: Two Meta-Psychology Papers
Open Reviews 2: Open Journal of Statistics 2012

Nonnegative Matrix Factorization with Zellner Penalty (OJS)
Matthew A. Corsetti, Ernest FokouĂ©  , Submitted 2015.
Recommended: Accept with minor revisions. Accepted.


Abstract
Please consider replacing “not much statistical tools” with “not many statistical tools”.
Keywords

May I suggest “goodness of fit”?  How about “outlier detection”?

1.       Introduction

Page 1
Consider replacing “would not be reliable and it will be misleading” with “can be unreliable or misleading.” (like most of my replacement suggestions, this is saying the same thing in fewer words.) and “The concept of Data …. in different fields.” With “Data Quality is a topic growing interest to researchers in many fields.”

Page 2
First two paragraphs seem obvious. The definition of Data Quality is good for formalism, but almost all your readers will already feel that data quality is important and don’t need to be convinced of this.

Either replace “Mahalanobis distance, is often” with “Mahalanobis distance, for example, is often”, or get rid of the comma.

“As discussed on Olsen [9], the…” feels like it should be its own paragraph.
Should “the data rule analysis puts rules” end in “puts IN rules” or “DEFINES rules”?  As is, it’s hard to read.

Please replace “especially the medical one” with “especially the medical field”.
In “aspect the data represent”, you need either “aspects” or “represents” for good grammar.
Please replace “time consuming especially” with “time consuming, especially”


Page 3
Please replace “Besides, usually in applications… observations again.” With “Besides, researchers often are confronted with data already gathered. They have no chance to re-measure the observations.”

Please replace “it seems more practical to study the data … auxiliary information.”
With “it is more practical to study the data quality based only on available information.”
Is the last sentence about motivation necessary?

2.       A proposed indicator of data quality
Consider replacing “the greater the difference … quality of the observation.” With “quality decreases with increasing distance between Xij and X*ij.”

Page 4
If you make a note near equation (2.3) that qi is the sample mean of qij for a given i, it would clarify things, and you might not even need (2.3).

In the example given in (2.2) and (2.3), a data set with many small errors would be given a higher quality than a data set with a massive error. Is this intentional?

I don’t understand the sentence “It makes sense that in … between variables [8].”
Please replace “the quality of say” with “the quality of a value such as”.  It’s more formal this way.

For “It would be expected now that”,  drop the ‘now’

And for “be somewhat close to”, drop the ‘somewhat’.  ‘Somewhat’ is a vague word and shouldn’t appear anywhere in the final publication.

For the three distances, instead of introducing a new vector (y1, y2), use the Xij values already defined. That way you spare the trouble of double notation and only defining two-dimensional distance measures.

Page 5

Replace EVERYTHING from  “Several methods…” to “… where p = 20%, 30%, 40%, or 60%” with this:

“Using any of these distance measures, a ‘Nearest Neighbour’ to i can be considered to be any value with distance to i less than the…

1) Mean

2) Median or other quantile

3) Median – k*s  for some 0.5 < k < 2 and standard deviation s
…of the distances of all observations from i.”

It says everything from that section in far fewer words, which will save your reader time and you money.

Consider replacing “Now after” with “After”.

When you have something like “between Xij”, it helps to put a reminder of what that symbol is. Here it’s “between the observed value Xij”.

Regarding “there is a need to calculate a distance measure”, this was confusing because we’d just defined distances measures. Could you emphasize the difference with something like “calculate a univariate distance measure”?

When naming a list like (Dij, Dij1, Dij2, …) I recommend starting with 1, making the list (Dij1, Dij2, Dij3, … or qi1, qi2, qi3,…), otherwise the first distance measure is set aside from the rest of the list for some reason.

Page 6
 Is equation (2.11) necessary? You’ve already told us it’s the mean.

3.       Multivariate Normal Case - Page 7
Where did the variance covariance matrix come from (capital Sigma)? Is it randomly generated?

Replace “ ~U6 ~(0,40) ” with “U6 ~ N(0,40) . ”, note the period.

Please replace “and qi*” with “and quality indicator qi*”
regarding “R version 2.13.0”, is your code available for replication by others. If so, does it have a set.seed() command so that others will get exactly the same numbers as you did?

Page 8

For the entries in Table 1, was a new batch of 2000 data sets built for each data set, or are these values describing the same batch of datasets?

Try replacing ”(2.10). Also this method was investigated to see . . . .  (2.13) till (2.17).” with
“(2.10), or by changing the quality indicator from (2.12) to one of (2.13) through (2.17).”

3.2   Multiple Linear Regression case. Page 10.

Please replace “For this case . . . . Models 3,4, and 5.” With “Three regression models, Models 3,4, and 5, were considered.”

Page 11

In the final draft, consider emphasizing or bolding the best value or notable values in large tables like Table 5. Also, consider dropping anything beyond three or four significant figures.  The reader gains nothing from knowing the best MSE is 0.01788029 that they wouldn’t from seeing it at 0.01788.

Regarding “distance measure given by (2.9)”, I think you equation references are off. It should be (2.7)

Same with “indicator given by (2.14)” a few lines later.

Page 12
Chebyshev has a few cases where its MSE is thousands of times larger than any other case. Is this because Chebyshev will get quality indicators that are extremely far off from the rest on rare occasions?

Page 13
“for this last case was calculated for n=100 and it was…”  is this for 2000 replications again?
3.4   Summary of the results of the simulation study

Page 14
Where did section 3.3 go? This section starts with “Section (3.3) presented a…”, but I think it means Section (3.2).
You have table of the best methods for Models 2,4, and 5, but  the reader has almost no knowledge of what these models are, besides linear regressions.

Page 15
What is total error? Is it the sum distance between the measured value Xij* and the true value Xij?

The first paragraph (after Table 12 and before 4. Application) confuses me. Are you saying that the quality indicators rarely exceed one and could be set to have a maximum of one for application purposes?  Try “Also, in simulation studies, the proposed quality indicator was between 0 and 1 in ____ of 2000 runs, and never exceeding 1.1.  If it aids interpretation, values of the quality indicator greater than one can be set to one.”

If so, it seems unorthodox. If there is no justification for this truncation, I recommend dropping this part.

Page 16

Are you using the recommendations of Table 11 for the data in Table 13?  Did you try other versions to see if higher correlation coefficients could be found?  Is a linear correlation appropriate given that quality scores are a transformation? (Spearman’s correlation is a good move here)  Would finding a positive correlation with distance, ignoring quality indications completely, be easier?  That would make this more like other assessment values like the AIC and MSE in which lower is better with a bottoming out at zero.

Should be “5. Conclusions”, not “4. Conclusions”.  The rest of the conclusion looks good.

Page 17

Who are the three colleagues at Cairo University Educational Hospitals?

Finally, this is useful looking work, and I want to see it in a journal, just not without some changes first. Good luck in future rounds.

No comments:

Post a Comment