## Saturday, 18 April 2015

### Using Sampling Order in Network Inference

In this previous post, I had talked about how the order of values were sampled, but I gave a pretty esoteric argument involving compression and loss of information in a sample of 8 values.

For my current thesis project, I'm developing an inference methodology for snowball samples. Snowball samples, or more generally respondent driven samples, are used to sample hard-to-reach populations or to explore network structure. The sampling protocol I'm working with is

1. Select someone to interview from the population by simple random sampling.
2. In the interview, ask that person the number of connections they have (where the definition of a  'connection' is defined by the study; it could be business partners, sexual partners, close friends, or fellow injection drug users.)
3. For each connection the interviewed person has, give that person a recruitment coupon to invite that connected person to come in for an interview, often with an incentive.
• If no connected person responds, go to Step 1 and repeat.
• If one connected person responds, go to Step 2 and repeat.
• If multiple people respond, go to Step 2 and interview each of them.
In reality, people are interviewed in the order they come in for an interview. In my simulation, people are interviewed in a breadth first search system, in which I assume connections to the first interviewed person are followed before any connections to those connections are followed. This may make a difference (which I will explore later), because of two limitations of the sample this protocol is modeled after:

1. When the desired sample size has been attained, no more people are interviewed, even if additional people respond to the invitation. (Budget limitation)
2. Connections that lead back to people already in the sample are ignored. Even the information about there being a connection is missing. (Confidentiality limitation)
The budget limitation implies that the interviewing order impacts the subjects that are included in the sample, which in turn affects any statistics measured, even those for which sampling order is ancillary.

The limitation on reported connections implies that the observed network structure is dependent upon the sampling order, even when controlling for the actual members of a sample. For example, if someone receives invitation coupons from multiple connections, they will only show up as directly connected to the first person whose coupon they use. The other connections may not even be observed to be the same network component, even though they must be by definition.

Furthermore, every sampled network will be a forest. That is, a collection of one or more trees - networks that have no cycles, connections 'back into themselves'. Only being able to observe these tree sub-networks of the actual network limits what inferences can be made about the network as a whole. Observed link density isn't very useful for well-connected graphs because the observed link density is maximal just shy of one link/node, in cases where the observed network is one huge component.

Alternate measures of density, like component diversity, ( the Shannon species diversity index, applied to the components that nodes are in), run into this problem as well. In Figure 1 below, component diversity is useless at estimating the average (responding) degree of the network for networks with more than 1.5 links/node.

 Figure 1, a network measure based only on observed network structure.

However, we can also estimate things like dPr(used)/dt, the rate of change in the proportion of a sample's connections that are also included in the sample over time. Barring variation due to response rate, we would consider such a measure to be non-positive.

For networks that are sparse in large populations, the rate of change should be near zero, because very few connections will lead back into nodes that were already explored. What few recruitments there are will be just as successful up until near the end of the sample when a few may be turned away.

However, for well-connected networks, sampling is done much more often by recruitment rather than by simple random sample. As more a component network is explored, the chance that each successive connection followed leading to someone already in the sample increases. Therefore, for well connected graphs, we would expect dPr(used)/dt to be distinctly negative.

As Figure 2 shows below, we can use the rate of change in successes to differentiate between populations up to 2.5 links/node, even though most of the networks that we observe from 1.5-2.5 links/node all show up as single large components. Also note how poorly dPr(used)/dt differentiates between networks in the region where the structure-based diversity index excels.

 Figure 2, a network measure based on sampling order.

Measures like these, in combination, can be used to make estimates of the features of the population as a whole, but that's for a poster I'm working in the coming weeks.