Featured post

Textbook: Writing for Statistics and Data Science

If you are looking for my textbook Writing for Statistics and Data Science here it is for free in the Open Educational Resource Commons. Wri...

Saturday 18 April 2015

Using Sampling Order in Network Inference

In this previous post, I had talked about how the order of values were sampled, but I gave a pretty esoteric argument involving compression and loss of information in a sample of 8 values.

For my current thesis project, I'm developing an inference methodology for snowball samples. Snowball samples, or more generally respondent driven samples, are used to sample hard-to-reach populations or to explore network structure. The sampling protocol I'm working with is


  1. Select someone to interview from the population by simple random sampling.
  2. In the interview, ask that person the number of connections they have (where the definition of a  'connection' is defined by the study; it could be business partners, sexual partners, close friends, or fellow injection drug users.)
  3. For each connection the interviewed person has, give that person a recruitment coupon to invite that connected person to come in for an interview, often with an incentive.
    • If no connected person responds, go to Step 1 and repeat.
    • If one connected person responds, go to Step 2 and repeat.
    • If multiple people respond, go to Step 2 and interview each of them.
In reality, people are interviewed in the order they come in for an interview. In my simulation, people are interviewed in a breadth first search system, in which I assume connections to the first interviewed person are followed before any connections to those connections are followed. This may make a difference (which I will explore later), because of two limitations of the sample this protocol is modeled after:

  1. When the desired sample size has been attained, no more people are interviewed, even if additional people respond to the invitation. (Budget limitation)
  2. Connections that lead back to people already in the sample are ignored. Even the information about there being a connection is missing. (Confidentiality limitation)
The budget limitation implies that the interviewing order impacts the subjects that are included in the sample, which in turn affects any statistics measured, even those for which sampling order is ancillary.

The limitation on reported connections implies that the observed network structure is dependent upon the sampling order, even when controlling for the actual members of a sample. For example, if someone receives invitation coupons from multiple connections, they will only show up as directly connected to the first person whose coupon they use. The other connections may not even be observed to be the same network component, even though they must be by definition.

Furthermore, every sampled network will be a forest. That is, a collection of one or more trees - networks that have no cycles, connections 'back into themselves'. Only being able to observe these tree sub-networks of the actual network limits what inferences can be made about the network as a whole. Observed link density isn't very useful for well-connected graphs because the observed link density is maximal just shy of one link/node, in cases where the observed network is one huge component.

Alternate measures of density, like component diversity, ( the Shannon species diversity index, applied to the components that nodes are in), run into this problem as well. In Figure 1 below, component diversity is useless at estimating the average (responding) degree of the network for networks with more than 1.5 links/node.


Figure 1, a network measure based only on observed network structure.

However, we can also estimate things like dPr(used)/dt, the rate of change in the proportion of a sample's connections that are also included in the sample over time. Barring variation due to response rate, we would consider such a measure to be non-positive. 

For networks that are sparse in large populations, the rate of change should be near zero, because very few connections will lead back into nodes that were already explored. What few recruitments there are will be just as successful up until near the end of the sample when a few may be turned away. 

However, for well-connected networks, sampling is done much more often by recruitment rather than by simple random sample. As more a component network is explored, the chance that each successive connection followed leading to someone already in the sample increases. Therefore, for well connected graphs, we would expect dPr(used)/dt to be distinctly negative. 

As Figure 2 shows below, we can use the rate of change in successes to differentiate between populations up to 2.5 links/node, even though most of the networks that we observe from 1.5-2.5 links/node all show up as single large components. Also note how poorly dPr(used)/dt differentiates between networks in the region where the structure-based diversity index excels.

Figure 2, a network measure based on sampling order.


Measures like these, in combination, can be used to make estimates of the features of the population as a whole, but that's for a poster I'm working in the coming weeks.
 

Monday 13 April 2015

Temporary Blogging Slowdown, Future Plans

It was my original intention to publish four posts on this blog throughout each week. However, I am currently in the last few months of my PhD thesis in Statistics at Simon Fraser University.

Most of my writing effort is being funneled into that in the hopes of defending before the end of next semester. As such, any posts that appear here until the thesis is submitted will be less frequent and of lower effort, such as the laser tag ranking post instead of the goal difference average posts.

With luck, there will be a burst of high-quality material here in here in the late summer and fall to make up for the slack and bring the average back to 4+ good posts per month. I have a lot left to write about, including:

- Snowball sampling and the network analysis work I'm doing currently.

- Milaap, Kiva, Sunfunder, and lending for good and profit.

- Driving forces behind application specific integrated integrated circuits (ASICs), specifically Bitcoin and Super Nintendo.

- FPGAs , and a pipe dream to build an ASIC for large matrix operations.

- Dealing with scrap textiles, composting, crafting, and future industrial needs.

- Some arguments for landfills being the future of mining. Methane harvests, leachate harvests, necessary drilling advances, and the possible gamification of sorting trash.

- Mental health in academia, and mental health personally.

- Many more kudzu levels.

- Plans for a kudzu app, and a pipe dream of making it a carbon negative game through ad revenue and the European Carbon Exchange.

- An essay on how we all suffer from  diversity defecits in heavily gendered fields, as written by a white, Anglophone, heterosexual, cis, middle class man from a nuclear family in the suburbs. (Not a satire, despite the title)

- ongoing development of a senior undergrad course on data formatting and management,

- more R package spotlights, with priority on mice for multiple imputation, and on gologit for modelling ordinal data.

- ongoing development of the R packette for sums of variables.

- ongoing development of an R packette on "debinning", and the methodology surrounding the partial recovery of within-bin data from binned data and covariate information.

- A 'to do' list for humanity, and preparation for a world of automation.

- A commentary on the energy transistion, peak child, and how the tar sands, arctic drilling, and shale fields  cannot possibly be profitable in the long term.