Statistics et al.

Friday, 20 May 2016

Biochar farming idea

Here's an idea I hope gets stolen if it's good, and shot down early if it's bad. Also, I claim no expertise in biology, apiary science, or soil ecology, so this could be drastically off the mark.

1. Pyrolyze massive amounts of organic waste to make biochar.
2. Use the biochar to make a soil that mimics land after a forest fire.
3. Grow high-value crops that grow best after a forest fire.
4. Incorporate secondary processes like beehives.

------------

First, pyrolysis is a technique for burning lingen-rich plant matter (i.e. wood, bamboo, corn stalks) in a low oxygen environment. Pyrolysis can produce a stable, porous form of charcoal called biochar. Biochar is commonly applied to soil to improve its capacity to store water and nutrients. Soil with large amounts of biochar can approximate a very high quality topsoil call terra preta.

Also, the process of creating biochar is carbon-negative. The carbon in the biochar is effectively locked out of the atmosphere more permanently than it would be if the biological matter is left to rot. It certainly keeps carbon out of the atmosphere more effectively than simple burning.

What happens, however, when the soil is mostly biochar, with just enough other parts (or a layer on top) to keep it from blowing away? I suspect that you would have a soil similar to what would be found on the ground after a forest fire.

Pioneer species are the first organisms to thrive after a forest fire. These include valuable plants like morel mushrooms, and, particularly, fireweed. Specifically, I've heard that honey derived from fireweed is valuable. However, the cultivation of fireweed honey is difficult because it has to be done in areas of recent forest fires, and therefore can't be done in the same place for very long.

My idea is to make soil that mimics the ground after a recent fire, and maintain that state with frequent renewal of biochar. Then, I want to use that soil to cultivate pioneer species.

With enough biochar, my hope is to have some farmland that can grow fireweed every year, and have permanently installed beehives amongst the fireweed. Rather than move the hives to where the land is suitable, I will work to maintain the land in a suitable state. With luck, the practice will pay for itself in harvests of fireweed honey, morels, and excess biochar.

If this works, then it can become a business that is both long-term profitable and carbon negative. There is an established demand for morels, and with more consumer awareness there could be a large demand for fireweed honey, so there is room for lots of people to try this.

--------

One potential issue is that bees prefer some flowers to others, and growing fireweed near the beehives doesn't guarantee that those are the flowers that will be used. Having a honey with a precursor that was part fireweed and part wildflowers may be acceptable, but it even guaranteeing that would mean influence an area much larger than the farmland.

Another issue is a sustainable supply of lingen-rich organic matter. Piles of slash and scrap wood are unreliable, one-time sources. Corn husks work nicely, but aren't available everywhere. Low-grade wood chips, called hog, would also work, but those are already used by pulp mills for energy. The city of Vancouver is already collecting organic waste for industrial composting, so it may be possible to tap into that pipeline. It will take some research to find how to use whatever is regularly available from local food processing and agriculture.

--------

ADDENDUM: user Osageandrot on Reddit, who does research on biochar for soil restoration, was kind enough to critique this idea and give some input. It was enough to show that this idea needs a major rework before anything more goes forward with this

First, any fresh biochar would need to be pre-aged by mixing with existing soil. This is because Polycyclic Aromatic Hydrocarbons (PAHs) are in too high of a concentration originally, and these will suppress microbal life.

Second, fresh application of biochar may not be necessary anyways. Most pioneer species are simply fast growers that need anything else that grows slower, but to a higher maximum, to be gone. Forest fires are not a necessary ingredient for a lot of these, but they just happen to provide the necessary conditions; clearcutting would do the same thing.

Tuesday, 17 May 2016

Course Notes on the PitchRx package

These are the notes I recently delivered as a guest lecturer for Simon Fraser University's course on Sports Analytics. It's a course for people with some experience with R, but not necessarily experts. As such, I made these notes with a beginners in mind.

If you want to see what PitchRx can really do, I recommend the below links. However, if you want to get started and you don't have much familiarity with R, and possibly none with SQL, the following notes are for you.

Kepler - The Biggest of Deals

Astrobiology, the study of life pertaining to outer space, is the most important and among the least useful fields. It involves the beginning and probable end of life as we know it, but what we find is too large to be used by anyone, or even everyone.

On May 10, 2016, NASA released this image:

The blue circles represent planets that have been previously found (confirmed), mostly by the Kepler satellite in the last few years. The orange shaded circles are those found since NASA's last announcement on the matter.

The size of each circle is proportional to the (estimated) size of the planet, the height is essentially the brightness of the star* that planet orbits. Near the top is our sun, a white** star, and further down are cooler, smaller, redder stars. The further to the right, the less bright that star is from the planet. Notice that Mars is to the right of Earth.

That green band down the middle of the chart, that's the habitable zone. Planets in that range are the possibly the right temperature to support carbon-based life. That doesn't mean these planets can support life, just that the first two criteria, heat and radiation, are in the right zones. Without that, terraforming for long-term carbon-based life is impossible.

Now that we understand the graph, some remarks.

This is amazing! When I graduated from high school, finding other planets meant speculating about a single planet beyond Pluto. Now, NASA confirms the existence of nearly 1300 newly found planets in the last year! Of those, nine new ones are in or near the habitable zone. These are all pointlessly far away, but it's a leap from nothing to something in our lifetimes.

We still don't know how relatively abundant these small, rocky, habitable zone planets are because larger, more massive gas giants are easier to find. It's worth considering that compared to other stars we can observe, the sun is a bit unusual regarding the high amount of metals it has (i.e. anything but hydrogen and helium), compared to other stars like it (i.e. its population). That even this many rocky planets is found is pretty marvelous.

Space is exciting!

Next, do you see how Earth is close to the too-hot edge of habitable zone? It wasn't always that close. This has nothing to do with global warming on a human history scale, main sequence stars get hotter over time.

A few billion years ago when life forms were much simpler, Earth would have shown up more to the right and a little bit down from where it is now. Earth would have been a lot cooler than now were it not for the fact its atmosphere was mostly carbon dioxide and its core had more radioactivity. Not only does Earth support life now, but its conditions have changed to offset the changes in the star it orbits in such a way that life was continually sustainable long enough to develop its current complexity.

The theory that life developed from self-replicating proteins and lipids on Earth is called abiogenesis, as in 'creation from non-life'. However, a growing body of evidence suggests that Terran life is currently too complex to have developed in time if it started on Earth. A competing theory, called panspermia 'life everywhere', suggests that some very simple life arrived from inside a meteor, after being kicked into space by something Michael Bay dreamed up.

This early life could have come from anywhere, but Mars is a likely source. NASA has also recently found flowing water on Mars, however it tends to boil away quickly without any air pressure. There's substantial evidence to suggest that ancient Mars was much warmer and with a thicker atmosphere, and that the atmosphere slowly boiled away because there wasn't enough gravity and magnetic field to keep it on the planet.

So to get to our current level of life complexity, we needed not one, but two habitable planets in order to buy enough time to develop. We didn't do it with much room for error either.

Remember how Earth is near the too-hot edge of the habitable zone? Well, the sun is still getting hotter, and there's nothing within the bounds of humanity to prevent that. In roughly half a billion years, Earth, assuming its orbit is the same, will have an average temperature of 55 C, and all remaining carbon will be locked away in rocks and out of the atmosphere. Without that carbon, no plant life can exist, and neither can we. Nothing smaller than moving the Earth itself to a wider orbit can prevent that in the long-term.

To put that in perspective, of the time that life can exist on Earth, that span is nearly 90% over, assuming the best-case scenario.

So...

1. We are lucky, insanely lucky, to exist.
2. Regarding the lack of contact from alien life, we could very well be past whatever stops most life from reaching any technological level - otherwise known as the Great Filter.
3. We can't stay home forever.

-------------------------

*assuming main sequence stars, like our sun is.** yes, white. It only looks yellow through our atmosphere.

Friday, 22 April 2016

Academic Proofreading / Copy Editing Samples

With the PhD wrapping up this summer, I can't default to 'do a higher degree' and have to go find a real job.

One option I've been considering is work in scientific, academic publishing. As a job, or just as a source of supplemental income, it seems ideal to me. It's the kind of job where I could actually add value to research by making it more ready to disseminate. Also, I have research experience in statistics, health science, molecular biology, and education. I write habitually. I'm a native English speaker who can also check the mathematical, and especially the statistical assertions in an academic paper for correctness before it goes to an editor, or to the public.

Copy editing work can be done without leaving Vancouver, in fact it can be done from a houseboat, or a houseboat city. Reading technical reports and academic papers would keep me actively learning and discovering. The work can be done at any time of day, and the amount of work can be adjusted to fit other, more time-specific activities.

There are companies like ManuscriptEdit and Scribendi that dispatch editing work to their own academics on contract. Many of their editors are PhDs with established careers and long publications records. These companies ask for, understandably, a proven record of copy editing ability and writing experience. Blog posts probably don't count.

There are a couple of certification programs, like the one from the Board of Editors in the Life Sciences (BELS) internationally, and the Editors' Association of Canada (EAC), whose scope is editing and proofreading in general. A certification would be great because it's a shorthand for proof of ability. For the BELS exam, the most convenient exam is this November in Florida, and I doubt I could be ready for that even if I could go. The EAC exams are probably doable locally, especially since their annual meeting is in Vancouver this summer, but it's a multi-year process.

So, I tried something with a smaller commitment. I selected short articles from open access journals, specifically ones with grammatical mistakes in their abstracts. Then, I printed out these articles, copy edited them as if they had been given to me before publication, and sent the results to the journals' editors, each with a request to be considered for future contract work.

I copy edited four articles that I can share here. The first three are recently published open-access articles, to which I received two 'no' responses. The last one is one of mine that was recently submitted, but I have permission from the other authors to share it here. Even though I didn't get a positive response, I wasn't expecting one, and 2/3 responses to a cold request at all feels pretty encouraging; it means I'm getting attention. I also got an invitation to be a volunteer peer reviewer for future papers, so there that for connections too.

I'm still reading about the copy editing process, so I think the last two are better than the first two.

Paper 1: Open Journal of Statistics - Predictive Modeling of Gas Production, Utilization, and Flaring in Nigeria...

Paper 2: American Journal of Computational Mathematics - Self Similarity Analysis of Web Users Arrival Pattern at Selected Web Centers

Paper 3: Journal of Data Analysis and Information Processing - Role of Feature Selection on Leaf Image Classification

Paper 4: Submitted - Tactics for Twenty20 Cricket.

nhlscrapr revisited

VanHAC, the Vancouver Hockey Analytics Conference, was April 9th, and I was presenting a tutorial on the nhlscrapr package for R.

This post is excerpts from the code I presented and gave out at the tutorial. The full tutorial expands my of my previous 'package spotlight' post on nhlscrapr. This post only includes the bare bones of downloading the raw games, examining the rate of goals scored and shots fired throughout the game, and making a basic player summary.

Also included is a patch to nhlscrapr I wrote that fixes a couple of functions ( full.game.database() , player.summary() ) that were throwing some errors, and adds a function ( aggregate.roster.by.name() ) that aids in matching player summaries to the proper names.

Reflections / Postmortem on teaching Stat 302 1601

This was the second course I have been the lecturer for, although I’ve had the bulk of the responsibility for several online courses as well. Every other course I’ve been responsible for had between 30 and 140 students. This one had 300.

Stat 302 is a course aimed towards senior-undergrad life and health science majors whom have completed a couple of quantitative courses before, including a similarly directed 200-level statistics course. It involved 3 hours of lectures per week for 14 weeks, a drop-in tutoring center in lieu of a directed lab section, 4 assignments, 2 midterms and a final exam. The topics to be covered were largely surrounding ANOVA, regression, modeling in general, and an introduction to some practical concerns like causality.

The standard textbook for this course was Applied Regression Analysis and Other Multivariate Methods (5th ed), but I opted not to use it to allow for more focus on practical aspects (at the cost of mathematical depth), as well as to save my students a collective $60,000.

I delivered about 75% of the lectures as fill-in-the-blank notes, where I had written pdf slides and sent them out to the students, but removed key words in the pre-lecture version of the slides. After each lecture the filled slides were made available. The rest of the lectures were in a practice problem / case studies format, where I sent out problems to solve before class, and solved them on paper under a video camera, with written and verbal commentary, during class. Most of these were made available too.

Everything can be found at http://www.sfu.ca/~jackd/Stat302.html for now.

What worked:

1. Focusing on the practical aspects of the material. This was a big gamble because it was a break from previous offerings of the course, and meant I had a lot less external material to work from. It was work the risk, and I’m proud of the course that was delivered.

I was able to introduce the theory of an ambitious range of topics, including logistic regression, with time to spare. The extra time was used for in-depth examples. This example time added a lot more value than an equal amount of time on formulae would have. It more closely reflected how these students will encounter data in future courses and projects, and the skills they will need to analyze that data.

The teaching assistants that talked to me about it had good things to say about the shift. The more keen students asked questions of a startling amount of depth and insight. I feel that there were only a few cases where understanding was less that what it would have been if I had given a more solid, proof based explanation of some method or phenomenon, rather than the data-based demonstrations I relied upon.

Although making the notes for the class was doubly hard because it was my first time and because I was breaking from the textbook, those notes are going to stand on their own in future offerings of Stat 302 and of similar courses. As a long-term investment, it will probably pay off. For this class, it probably hurt the attendance rate because students knew the filled notes would be available to them without attending. My assumption about these non-attendees is that they would gain little more from showing up that they wouldn’t from reading the notes and doing the assignments.

2. Using R. At the beginning of the semester, I polled the students about their experience with different statistical software, and the answers were diverse. A handful of students had done statistics with SPSS, JMP, SAS, Excel, and R, and without much overlap. That meant that any software I chose would be new to most of the students. As such, I feel back to my personal default of R.

Using R meant that I could essentially do the computation for the students by providing them the necessary code with their assignments. It saved me some of lecture time that would have otherwise been spent providing a step-by-step of how to manage in a point-and-click environment. It also saved me the lecture time and personal time spent dealing with inevitable license and compatibility issues that would have arisen from using anything not open source.

Also, now the students have experience with an analysis tool that they can access after the class is over. Even though many students had no programming experience, I feel like they got over the barrier of programming easily enough. There were some major hiccups which can hopefully be avoided in the future.

3. Announcing and keeping a hard line on late assignments. In my class, hard copies of assignments were to be handed in to a drop box for a small army of teaching assistants to grade and return. Any late assignments would have added a new layer of logistics to this distribution, so I announced in the first day that late assignments would not be graded at all. This also saved me a lot of grief with regards to judging which late excuses were ‘worthy’ of mercy or not, or trying to verify them.

4. Using a photo-to-PDF app on my phone. It’s faster and more convenient than using a scanner. Once I started using one, posting keys and those case study style lecture notes became a lot easier.

5. Including additional readings in the assignments. The readings provided the secondary voice to the material that would have otherwise been provided by the textbook. Since I've posted answers to the questions I wrote, I will need to make new questions in order to reuse the articles, but the discovery part is already done.

6. The Teaching Assistants. By breaking from the typical material, I was also putting an extra burden on the teaching assistants to also have knowledge beyond the typical Stat 302 offering. They kept this whole thing together, and they deserve some credit.

What I learned by making mistakes:

1. USE THE MICROPHONE. I have good lungs and a very strong voice, so even when a microphone is available, my preference has been to deliver lectures unaided. This approach worked up until one morning in Week 3 when I woke up mute and choking on my own uvula. Two hours of lectures had to be cancelled.

2. Use an online message board. For a large class, having message board goes a long way. It allows you to answer a student question once or two, rather than several times over e-mail. I had underestimated the number of times I would get the same question, and answer the question in class didn’t seem to help because of the 45-60% attendance rate. Other than the classroom, my other option to send out a mass email, which, aside from sending out lecture notes, was done sparingly.

A message board also serves the same purpose of a webpage as a repository of materials like course notes, datasets, and answer keys.

3. Do whatever you can in advance. Had I simply spent more time writing more rough drafts of lectures, or made or found some datasets to use, before the start of class in January, that time spent would have paid off more than one-to-one. How? Because I still had to do that work AND deal with the effects of lost sleep afterwards. There were a few weeks where my life was a cycle of making material for the class at the last minute, and recovering for working until dawn. Thank goodness I was only responsible for one course.

4. Distrust your own code. I have a lot of experience with writing R code on the fly, so I thought I could get away with minimal testing of the example code I wrote and gave out with assignments. Never again.

One of my assignments was a logistical disaster. First, a package dependency had recently changed, so even though on my system I could get away with a single library() call to load every function needed for the assignment, many students needed two. For others, the package couldn't even be installed.

Also, when testing the code for a question, I had removed all the cases with missing data before running an analysis. I didn’t think it would make any difference because the regression function, lm() removes these cases automatically anyways. It turns out that missing data can seriously wreck the stepAIC() function, even if the individual lm() calls within the function handle it fine.

In the future, I will either try to take any necessary functions from packages and put them into a code file that can called with source(), or I will provide the desired output with the example code. This also ties back into working until dawn: quality suffers.

5. Give zero weight on assignments. The average score on my assignments was about 90%, and with little variation. As a means of separating students by ability, the assignments failed completely. As a means of providing a low-stakes venue for learning without my supervision, I can’t really tell. The low variation and other factors in the class suggest to me a lot of copying or collusion. Identifying which students are copying each other, or are merely writing things verbatim from my notes is infeasible – even with teaching assistants. The number of comparisons grows with the SQUARE of the number of students, and comparisons are hard to judge fairly in statistics already.

One professor here with a similar focus on practicality, Carl Schwarz, gives 0% grade weight to assignments in some classes. The assignments are still marked for feedback, I assume, but only exams are used for summative evaluation. This would be ideal for the next time I teach this course.

I would expect the honest and interested students to hand in work for practice and feedback and they would not be penalized grade-wise for not handing in a better, but copied, answer. I would expect the rest of the students to simply not hand anything in, which isn’t much worse for them than copying, and would save my teaching assistants time and effort.

Friday, 4 March 2016

Reading Assignments - Model Selection and Missing Data

These are two more readings that were incorporated into a 3rd year stats course geared towards life and health sciences. One is a model selection paper geared towards ecologists, and the other is a paper on missing data and imputation in the context of medicine and survival analysis.

About the Author

Jack Davis is a teaching professor in Statistics at the University of Waterloo, Canada.

Their research spans statistics in sport, data mining, adult education and Bayesian computation.

They have a course called "Statistics, Gambling, and Games of Chance" on Udemy.

They also pretend to be an expert in writing and game design, which is why they wrote a textbook called "Writing for Statisticians" and why they programmed a video game called "Doc Logic" for Xbox 360.

They enjoy chess and chess variants, but they are so, so, very bad at them. They want to try living on a houseboat sometime.