Statistics et al.: When to use "the" or "a" in scientific writing

A.K.A. : "The", the definitive definite article article.

"The", while making up about 7% of all written and spoken English words, is the hardest word to get right. The rules surrounding "the" are so difficult to define that comprehensive dictionaries can spend 5 of more pages trying...

This article is meant to be part of the preliminary reading in a 'Writing for Statistics and Data Science' course or 'Scientific Writing' course. Either missing 'the' or 'a', or putting it in the wrong place is the most common error I've seen when grading written assignments or even when copyediting scientific journal articles.

As a teacher, it's a tough call whether to put this as an optional 'week 0' reading, or as part of a week 1 lecture. Optional work rarely gets done (although in week 0-1 maybe things are quiet enough), but as a core part of the course it's a pretty bland first impression.

Article starts after the sprawling puppy.

"The" and "a/an" are parts of speech called articles. "The" is English's definite article, which means it's used when there is a specific object being referred to, generally either because only one exists, or because one example of the item is more important than all the others.

"A" and "an" are indefinite articles. They are used to refer to any one of multiple similar items. "An" is used when the next word starts a with a vowel sound (A,E,I,O,U, but also HA, HE, HI, HO, HU), and "a" is used for everything else. Special attention is needed when referring to letters.

"The restaurant got a D grade" (D is pronounced "dee", which starts with a consonant)

"The restaurant got an F grade" (F is pronounced "eff", which starts with a vowel)

----

Setting the difference between "a" and "an" is straight forward, but what about "a" versus "the"? Consider:

"Here is A glass of water" (This sentence implies there are multiple glasses of water, and this one is just only one.)

"Here is THE glass of water." (This implies that there is only one relevant glass of water)

"Here is THE glass of water you requested." (Does not imply glasses of water in general. It does imply that you requested only one glass.)

General rule: When something is mentioned already, there is only one relevant instance of that thing, so "the" is used.

----

I ran a (1) regression tree analysis to find which variables were important. The (2) analysis showed that "age" was the (3) most important variable.

For (1), many regression tree analyses could be done and have been done, and this is just one of those many, so we use "a".

For (2), we're talking about the same analysis as in (1), but at this point it's been established that there is one relevant analysis, so we use "the".

For (3), there is only one 'most important variable', so we use "the".

General rule: "the" can refer to multiple objects, as long as it's all the objects in that collection.

After generating a (1) random forest, we found the (2) most important variables. Age was an (3) important variable, but so was income.

For (1), this random forest is just one of many that could be generated.

For (2), although there are multiple variables being referenced, we are referring to all of them, so we use "the" to help show that there are no others. As a general rule

For (3), age is only one of the multiple important variables, so we use "an".

General rule: If you want to refer to multiple objects, but part of a collection, you can do so by prepending some indication that you're only looking at a part.

----

Half of the (1) data points are positive, a few of the (2) points are outliers.

For (1), "the data points" refers to all of the points, and "Half of" modifies this.

For (2), "the points" is a shorter version of "the data points", and "a few of" modifies this. We have already established that they're data points, so we can shorten this to "points".

----

Most of the (1) temperature values are negative, but a few of them (2) are above 30.

For (1), there are multiple temperature values, "the temperature values" would refer to all of the values"; 'most of' modifies that.

For (2), "them" is a shortcut to referring to "temperature values", which have already established as subject of the paragraph.

----

Some of the (1) time, I wonder about rounding error. Sometimes, I think they (2) are overlooked.

For (1), "the time" is an abstraction, and it always needs a modification, even if that modifier is "all of". We could shorten "Some of the time" to "Sometimes" like we did in the second sentence.

For (2), "they" is a shortcut to referring to "rounding error". "They" refers to a plural, and there is only one plural it could refer to.

General rule: We can use "the" and "a" to imply things that would not otherwise be obvious in the writing.

----

The (1) woman who wrote this (2) R package is a (3) genius.

For (1), "The woman" tells the reader that there is only one woman who wrote this R package, even if it hasn't yet been established who that is yet.

For (2), "this R package" assumes that the reader knows what R package is being referred to.

For (3), "a genius" implies that more than one genius exists, and the woman is one of them.

----

In the (1) third step of the (2) optimization, the (3) parameters are updated to a (4) better fit to the (5) data.

For (1), there may be multiple steps in the optimization, but only one of them is the third one.

For (2), there may be many optimizations, but by the rest of the context, it's clear that one particular optimization is being described. We could also write "this optimization".

For (3), "the parameters" refers to all the parameters being used.

For (4), "a better fit" implies that there could be other better fits too. If it was the best fit, it has to be "the best fit" instead of "a best fit" because "best" implies only one.

For (5), "the data" refers to all of the relevant data.

General rule: If something is obviously unique, or if something is being referred to in the abstract, usually we don't need "the" or "a".

-----

Linear regression (1) is an (2) unbiased method to fit data(3), therefore the (4) mean of the (5) residuals is zero.

For (1), "Linear regression" is a name for a specific method, it's already clear that there is only one method called "linear regression" without using "the".

For (2), there is more than one "unbiased method", so use "an".

For (3), "data" is being referred abstractly. If we were referring to some specific set of data, we use either "a set of data" (because there are many sets, this is only one), or "the data" (because, while there is many points of data, we are referring to all them in a set).

For (4), there is only one mean of these residuals.

For (5), all of the results are being referred to.

----

Google (1) used to recruit employees (2) by anonymously posting puzzles in public locations (3).

For (1), "Google" is a proper name for a company, so we don't need to clarify that it's unique.

For (2), it's already clear that "employees" is referring to Google's employees, so, while we could say "Google's employees" or "their employees", we don't need to.

----

New York (1) is the city (2) that hosts IBM's main headquarters.

For (1), "New York" (or "New York City") is a proper name of a city. Even if they are multiple places called New York, we wouldn't say "a New York", we would clarify with additional information like the US state (New York, New York).

For (2), there is only one city that hosts IBM's main HQ. We write "the" because we're now adding new information beyond simply the name of the city.

(A much cleaner way to write this whole sentence would be "New York hosts IBM's main headquarters")

Some more specific examples and rules can be found in "The definite article" by EF Education First

https://www.ef.com/wwen/english-resources/english-grammar/definite-article/

Mini exercise: Fix the following sentences.

(Answers after the tiger lily)

Q1: It's not possible to describe whole story

Q2: Instead of training a simple model directly from training set,

Q3: simple model outperforms the same model trained directly from training set.

Q4: On each leaf node, samples are split into training set and testing set.

Q5: Figure 1 shows that false positive rate is larger.

Q6: We used the string-manipulation functions in stringr package.

Q7: proof of this theorem is found in appendix.

Q8: In next section, we discuss alternate to our method.

Q9: By equation (1), together with above two implied integrals, we have a complete proof.

Q10: As a result, new method selects very few explanatory variables, which causes poor model fits.

Q11: ensemble approach is based on combination of many models.

Q12: Kernel density estimation is based on mean density of smoothed individual observations.

Q13: We assume that there is no missing data in each of datasets.

Q14: Consider simplified problem, in which N=5.

Q15: Consider simplest problem, in which N=1.

Q16: Note that the minimum value for the test statistic is zero, and that upper bound does not exist.

Q17: Thus, only marginal distribution is known.

Q18: We obtain the following mixture of k densities as the aggregate density for k-th class under the

following

A1: It's not possible to describe [THE] whole story

A2: Instead of training a simple model directly from [A] training set,

A3: [THE] simple model outperforms the same model trained directly from [A] training set.

A4: On each leaf node, samples are split into [A] training set and [A] testing set.

A5: Figure 1 shows that [THE] false positive rate is larger.

A6: We used the string-manipulation functions in [THE] stringr package.

A7: [A] proof of this theorem is found in [THE] appendix.

A8: In [THE] next section, we discuss [AN] alternate to our method.

A9: By equation (1), together with [THE] above two implied integrals, we have a complete proof.

A10: As a result, [THE] new method selects very few explanatory variables, which causes poor model fits.

A11: [THE] ensemble approach is based on [A] combination of many models.

A12: Kernel density estimation is based on [THE] mean density of [THE] smoothed individual observations.

A13: We assume that there is no missing data in each of [THE] datasets.

A14: Consider [A] simplified problem, in which N=5.

A15: Consider [THE] simplest problem, in which N=1.

A16: Note that the minimum value for the test statistic is zero, and that [AN] upper bound does not exist.

A17: Thus, only [THE] marginal distribution is known.

A18: We obtain the following mixture of k densities as the aggregate density for [THE] k-th class under the following conditions:

----

Scientific writing doesn't get the attention it deserves, but the work of Dr. Sandy Littletree's Indigenous Library and Information Services deserves a lot more attention. Library work is more than just shelving books, it's a hefty task to organize the world's information, double so for indigenous info. Go 'check out' Dr. Littletree's projects here!

Statistics et al.

Featured post

Textbook: Writing for Statistics and Data Science

Friday, 26 June 2020

When to use "the" or "a" in scientific writing

No comments:

Post a Comment