A.K.A. : "The", the definitive definite article article.

"The", while making up about 7% of all written and
spoken English words, is the hardest word to get right. The rules
surrounding "the" are so difficult to define that comprehensive
dictionaries can spend 5 of more pages trying...

*This article is meant to be part of the preliminary reading in a 'Writing for Statistics and Data Science' course or 'Scientific Writing' course. Either missing 'the' or 'a', or putting it in the wrong place is the most common error I've seen when grading written assignments or even when copyediting scientific journal articles.*

*As a teacher, it's a tough call whether to put this as an optional 'week 0' reading, or as part of a week 1 lecture. Optional work rarely gets done (although in week 0-1 maybe things are quiet enough), but as a core part of the course it's a pretty bland first impression.*

*Article starts after the sprawling puppy.*

"The" and "a/an" are parts of speech
called articles. "The" is English's definite article, which means
it's used when there is a specific object being referred to, generally either
because only one exists, or because one example of the item is more important
than all the others.

"A" and "an" are indefinite articles.
They are used to refer to any one of multiple similar items. "An" is
used when the next word starts a with a vowel sound (A,E,I,O,U, but also HA,
HE, HI, HO, HU), and "a" is used for everything else. Special
attention is needed when referring to letters.

"The restaurant got a D grade" (D is pronounced
"dee", which starts with a consonant)

"The restaurant got an F grade" (F is pronounced
"eff", which starts with a vowel)

----

Setting the difference between "a" and
"an" is straight forward, but what about "a" versus
"the"? Consider:

"Here is A glass of water" (This sentence implies
there are multiple glasses of water, and this one is just only one.)

"Here is THE glass of water." (This implies that
there is only one relevant glass of water)

"Here is THE glass of water you requested." (Does
not imply glasses of water in general. It does imply that you requested only
one glass.)

**General rule:**When something is mentioned already, there is only one relevant instance of that thing, so "the" is used.

----

*I ran a (1) regression tree analysis to find which variables were important. The (2) analysis showed that "age" was the (3) most important variable.*

For (1), many regression tree analyses could be done and
have been done, and this is just one of those many, so we use "a".

For (2), we're talking about the same analysis as in (1),
but at this point it's been established that there is one relevant analysis, so
we use "the".

For (3), there is only one 'most important variable', so we
use "the".

**General rule:**"the" can refer to multiple objects, as long as it's all the objects in that collection.

*After generating a (1) random forest, we found the (2) most important variables. Age was an (3) important variable, but so was income.*

For (1), this random forest is just one of many that could
be generated.

For (2), although there are multiple variables being
referenced, we are referring to all of them, so we use "the" to help
show that there are no others. As a general rule

For (3), age is only one of the multiple important
variables, so we use "an".

**General rule: If you want to refer to multiple objects, but part of a collection, you can do so by prepending some indication that you're only looking at a part.**

----

*Half of the (1) data points are positive, a few of the (2) points are outliers.*

For (1), "the data points" refers to all of the
points, and "Half of" modifies this.

For (2), "the points" is a shorter version of
"the data points", and "a few of" modifies this. We have
already established that they're data points, so we can shorten this to
"points".

----

*Most of the (1) temperature values are negative, but a few of them (2) are above 30.*

For (1), there are multiple temperature values, "the
temperature values" would refer to all of the values"; 'most of'
modifies that.

For (2), "them" is a shortcut to referring to
"temperature values", which have already established as subject of
the paragraph.

----

*Some of the (1) time, I wonder about rounding error. Sometimes, I think they (2) are overlooked.*

For (1), "the time" is an abstraction, and it
always needs a modification, even if that modifier is "all of". We
could shorten "Some of the time" to "Sometimes" like we did
in the second sentence.

For (2), "they" is a shortcut to referring to
"rounding error". "They" refers to a plural, and there is
only one plural it could refer to.

**General rule:**We can use "the" and "a" to imply things that would not otherwise be obvious in the writing.

----

*The (1) woman who wrote this (2) R package is a (3) genius.*

For (1), "The woman" tells the reader that there
is only one woman who wrote this R package, even if it hasn't yet been
established who that is yet.

For (2), "this R package" assumes that the reader
knows what R package is being referred to.

For (3), "a genius" implies that more than one
genius exists, and the woman is one of them.

*----*

*In the (1) third step of the (2) optimization, the (3) parameters are updated to a (4) better fit to the (5) data.*

For (1), there may be multiple steps in the optimization,
but only one of them is the third one.

For (2), there may be many optimizations, but by the rest of
the context, it's clear that one particular optimization is being described. We
could also write "this optimization".

For (3), "the parameters" refers to all the
parameters being used.

For (4), "a better fit" implies that there could
be other better fits too. If it was the best fit, it has to be "the best
fit" instead of "a best fit" because "best" implies
only one.

For (5), "the data" refers to all of the relevant
data.

**General rule:**If something is obviously unique, or if something is being referred to in the abstract, usually we don't need "the" or "a".

*-----*

*Linear regression (1) is an (2) unbiased method to fit data(3), therefore the (4) mean of the (5) residuals is zero.*

For (1), "Linear regression" is a name for a
specific method, it's already clear that there is only one method called
"linear regression" without using "the".

For (2), there is more than one "unbiased method",
so use "an".

For (3), "data" is being referred abstractly. If we
were referring to some specific set of data, we use either "a set of
data" (because there are many sets, this is only one), or "the
data" (because, while there is many points of data, we are referring to
all them in a set).

For (4), there is only one mean of these residuals.

For (5), all of the results are being referred to.

----

*Google (1) used to recruit employees (2) by anonymously posting puzzles in public locations (3).*

For (1), "Google" is a proper name for a company,
so we don't need to clarify that it's unique.

For (2), it's already clear that "employees" is
referring to Google's employees, so, while we could say "Google's
employees" or "their employees", we don't need to.

----

*New York (1) is the city (2) that hosts IBM's main headquarters.*

For (1), "New York" (or "New York City")
is a proper name of a city. Even if they are multiple places called New York,
we wouldn't say "a New York", we would clarify with additional
information like the US state (New York, New York).

For (2), there is only one city that hosts IBM's main HQ. We
write "the" because we're now adding new information beyond simply
the name of the city.

(A much cleaner way to write this whole sentence would be
"New York hosts IBM's main headquarters")

Some more specific examples and rules can be found in "The
definite article" by EF Education First

**Mini exercise: Fix the following sentences.**

*(Answers after the tiger lily)*

Q1: It's not possible to describe whole story

Q2: Instead of training a simple model directly from
training set,

Q3: simple model outperforms the same model trained directly
from training set.

Q4: On each leaf node, samples are split into training set
and testing set.

Q5: Figure 1 shows that false positive rate is larger.

Q6: We used the string-manipulation functions in stringr
package.

Q7: proof of this theorem is found in appendix.

Q8: In next section, we discuss alternate to our method.

Q9: By equation (1), together with above two implied
integrals, we have a complete proof.

Q10: As a result, new method selects very few explanatory
variables, which causes poor model fits.

Q11: ensemble approach is based on combination of many
models.

Q12: Kernel density estimation is based on mean density of
smoothed individual observations.

Q13: We assume that there is no missing data in each of
datasets.

Q14: Consider simplified problem, in which N=5.

Q15: Consider simplest problem, in which N=1.

Q16: Note that the minimum value for the test statistic is
zero, and that upper bound does not exist.

Q17: Thus, only marginal distribution is known.

Q18: We obtain the following mixture of k densities as the
aggregate density for k-th class under the

following

A1: It's not possible to describe [THE] whole story

A2: Instead of training a simple model directly from [A] training set,

A3: [THE] simple model outperforms the same model trained directly
from [A] training set.

A4: On each leaf node, samples are split into [A] training set and [A]
testing set.

A5: Figure 1 shows that [THE] false
positive rate is larger.

A6: We used the string-manipulation
functions in [THE] stringr package.

A7: [A] proof of this theorem is found
in [THE] appendix.

A8: In [THE] next section, we discuss
[AN] alternate to our method.

A9: By equation (1), together with
[THE] above two implied integrals, we have a complete proof.

A10: As a result, [THE] new method
selects very few explanatory variables, which causes poor model fits.

A11: [THE] ensemble approach is based
on [A] combination of many models.

A12: Kernel density estimation is
based on [THE] mean density of [THE] smoothed individual observations.

A13: We assume that there is no
missing data in each of [THE] datasets.

A14: Consider [A] simplified problem,
in which N=5.

A15: Consider [THE] simplest problem,
in which N=1.

A16: Note that the minimum value for
the test statistic is zero, and that [AN] upper bound does not exist.

A17: Thus, only [THE] marginal
distribution is known.

A18: We obtain the following mixture
of k densities as the aggregate density for [THE] k-th class under the
following conditions:

----

## No comments:

## Post a comment