| Title: | Approximate Inclusion Probabilities for Survey Sampling | 
| Version: | 0.1.5 | 
| Date: | 2023-08-26 | 
| Description: | Approximate joint-inclusion probabilities in Unequal Probability Sampling, or compute Monte Carlo approximations of the first and second-order inclusion probabilities of a general sampling design as in Fattorini (2006) <doi:10.1093/biomet/93.2.269>. | 
| Depends: | R (≥ 4.0.0) | 
| License: | GPL-3 | 
| Encoding: | UTF-8 | 
| BugReports: | https://github.com/rhobis/jipApprox/issues | 
| RoxygenNote: | 7.2.3 | 
| Imports: | sampling | 
| NeedsCompilation: | no | 
| Packaged: | 2023-08-26 08:11:03 UTC; Roberto | 
| Author: | Roberto Sichera [aut, cre] | 
| Maintainer: | Roberto Sichera <rob.sichera@gmail.com> | 
| Repository: | CRAN | 
| Date/Publication: | 2023-08-26 08:40:02 UTC | 
jipApprox: Approximate inclusion probabilities for survey sampling
Description
Approximate joint-inclusion probabilities in Unequal Probability Sampling, or compute Monte Carlo approximations of the first and second-order inclusion probabilities of a general sampling design as in Fattorini (2006) <doi:10.1093/biomet/93.2.269>.
Approximation of Joint-inclusion probabilities
Function jip_approx provides a number of approximations of the
second-order inclusion probabilities that require only the first-order inclusion
probabilities. These approximations may be employed in unequal probability sampling
design with high entropy. A more flexible approximation may be obtained by using
function jip_MonteCarlo, which estimates inclusion probabilities
through a Monte Carlo simulation.
The variance of the Horvitz-Thompson total estimator may be then estimated by
plugging the approximated joint probabilities into the Horvitz-Thompson or
Sen-Yates-Grundy variance estimator using function HTvar.
Author(s)
Maintainer: Roberto Sichera rob.sichera@gmail.com
References
Matei, A.; Tillé, Y., 2005. Evaluation of variance approximations and estimators in maximum entropy sampling with unequal probability and fixed sample size. Journal of Official Statistics 21 (4), 543-570.
Haziza, D.; Mecatti, F.; Rao, J.N.K. 2008. Evaluation of some approximate variance estimators under the Rao-Sampford unequal probability sampling design. Metron LXVI (1), 91-108.
Fattorini, L. 2006. Applying the Horvitz-Thompson criterion in complex designs: A computer-intensive perspective for estimating inclusion probabilities. Biometrika 93 (2), 269-278
See Also
Useful links:
- Report bugs at https://github.com/rhobis/jipApprox/issues 
Variance of the Horvitz-Thompson estimator
Description
Compute or estimate the variance of the Horvitz-Thompson total estimator by the Horvitz-Thompson or Sen-Yates-Grundy variance estimators.
Usage
HTvar(y, pikl, sample = TRUE, method = "HT")
Arguments
| y | numeric vector representing the variable of interest | 
| pikl | matrix of second-order (joint) inclusion probabilities; the diagonal must contain the first-order inclusion probabilities. | 
| sample | logical value indicating whether sample or population values are provided.
If  | 
| method | string, indicating if the Horvitz-Thompson ( | 
Details
The Horvitz-Thompson variance is defined as
\sum_{i\in U}\sum_{j \in U} \frac{(\pi_{ij} - \pi_i\pi_j)}{\pi_i\pi_j} y_i y_j 
which is estimated by
\sum_{i\in U}\sum_{j \in U} \frac{(\pi_{ij} - \pi_i\pi_j)}{\pi_i\pi_j\pi_{ij}} y_i y_j 
The Sen-Yates-Grundy variance is obtained from the Horvitz-Thompson variance by conditioning on the sample size n, and is therefore only appliable to fixed size sampling designs:
\sum_{i\in U}\sum_{j > i} (\pi_i\pi_j - \pi_{ij}) \Biggl(\frac{y_i}{\pi_i} - \frac{y_j}{\pi_j} \Biggr)^2 
Its estimator is
\sum_{i\in U}\sum_{j > i} \frac{(\pi_i\pi_j - \pi_{ij})}{\pi_{ij}} \Biggl(\frac{y_i}{\pi_i} - \frac{y_j}{\pi_j} \Biggr)^2 
Examples
### Generate population data ---
N <- 500; n <- 50
set.seed(0)
x <- rgamma(500, scale=10, shape=5)
y <- abs( 2*x + 3.7*sqrt(x) * rnorm(N) )
pik  <- n * x/sum(x)
pikl <- jip_approx(pik, method='Hajek')
### Dummy sample ---
s   <- sample(N, n)
### Compute Variance ---
HTvar(y=y, pikl=pikl, sample=FALSE, method="HT")
HTvar(y=y, pikl=pikl, sample=FALSE, method="SYG")
### Estimate Variance ---
#' HTvar(y=y[s], pikl=pikl[s,s], sample=TRUE, method="HT")
#' HTvar(y=y[s], pikl=pikl[s,s], sample=TRUE, method="SYG")
Brewer sampling procedure ————————————————–
Description
Brewer sampling procedure ————————————————–
Usage
brewer(pik, n, N, s, list)
Arguments
| pik | vector of first-order inclusion probabilities | 
| n | sample size | 
| N | population size | 
| s | vector of length N, with 1s at the positions of self-selecting units | 
| list | vector with positions of self selcting units | 
Note
this function is a modified version of function UPbrewer,
from the sampling package.
Exclude self-selecting units
Description
Exclude self-selecting units and units with probability zero and returns a list with parameters needed to perform sampling
Usage
excludeSSU(pik, eps = 1e-06)
Arguments
| pik | vector of first-order inclusion probabilities | 
| eps | control value for pik | 
Note
the code is taken from package sampling
Check if a number is integer
Description
Check if x is an integer number, differently from is.integer,
which checks the type of the object x
Usage
is.wholenumber(x, tol = .Machine$double.eps^0.5)
Arguments
| x | a scalar or a numeric vector | 
| tol | a scalar, indicating the tolerance | 
Note
From the help page of function is.integer
Transform a Joint-Inclusion Probability data.frame to a matrix
Description
Transform a Joint-Inclusion Probability data.frame to a matrix
Usage
jipDFtoM(jip, symmetric = TRUE)
Arguments
| jip | vector or data.frame containing the joint-inclusion probabilities | 
| symmetric | boolean, if  | 
Value
a symmetric matrix of joint-inclusion probabilities if TRUE, otherwise,
an upper triangular matrix
Transform a matrix of Joint-Inclusion Probabilities to a data.frame
Description
Transform a matrix of Joint-Inclusion Probabilities to a data.frame
Usage
jipMtoDF(jip, id = NULL)
Arguments
| jip | a square matrix of joint-inclusion probabilities, symmetric or upper-triangular | 
| id | optional, vector of id labels, its length should be equal to
 | 
Brewer's joint-inclusion probability approximations
Description
Approximation of joint inclusion probabilities by one of the estimators proposed by Brewer and Donadio (2003)
Usage
jip_Brewer(pik, method)
Arguments
| pik | numeric vector of first-order inclusion probabilities for all population units. | 
| method | string representing one of the available approximation methods. | 
Details
"Brewer18" is the approximation showed in equation (18) of Brewer and Donadio (2003)
Hájek's joint-inclusion probability approximation
Description
Estimate joint-inclusion probabilities using Hájek (1964) equation
Usage
jip_Hajek(pik)
Arguments
| pik | numeric vector of first-order inclusion probabilities for all population units. | 
Hartley-Rao approximation of joint-inclusion probabilities
Description
Approximation of joint-inclusion probabilities with precision of order
O(N^{-4}) for the random systematic sampling design
by Hartley and Rao (1962), pag. 369 eq. 5.15
Usage
jip_HartleyRao(pik)
Arguments
| pik | numeric vector of first-order inclusion probabilities for all population units. | 
Approximate inclusion probabilities by Monte Carlo simulation
Description
Approximate first and second-order inclusion probabilities by means of Monte Carlo simulation. Estimates are obtained as proportion of the number of occurrences of each unit or couple of units over the total number of replications. One unit is added to both numerator and denominator to assure strict positivity of estimates (Fattorini, 2006).
Usage
jip_MonteCarlo(
  x,
  n,
  replications = 1e+06,
  design,
  units,
  seed = NULL,
  as_data_frame = FALSE,
  design_pars,
  write_on_file = FALSE,
  filename,
  path,
  by = NULL,
  progress_bar = TRUE
)
Arguments
| x | size measure or first-order inclusion probabilities, a vector or single-column data.frame | 
| n | sample size (for fixed-size designs), or expected sample size (for Poisson sampling) | 
| replications | numeric value, number of independent Monte Carlo replications | 
| design | sampling procedure to be used for sample selection. Either a string indicating the name of the sampling design or a function; see section "Details" for more information. | 
| units | indices of units for which probabilities have to be estimated. Optional, if missing, estimates are produced for the whole population | 
| seed | a valid seed value for reproducibility | 
| as_data_frame | logical, should output be in a data.frame form? if FALSE, a matrix is returned | 
| design_pars | only used when a function is passed to argument  | 
| write_on_file | logical, should output be written on a text file? | 
| filename | string indicating the name of the file to create on disk,
must include the  | 
| path | string indicating the path to the directory where the output file
should be created; only applies if  | 
| by | optional; integer scalar indicating every how many replications a partial output should be saved | 
| progress_bar | logical, indicating whether a progress bar is desired | 
Details
Argument design accepts either a string indicating the sampling design
to use to draw samples or a function.
Accepted designs are "brewer", "tille", "maxEntropy", "poisson",
"sampford", "systematic", "randomSystematic".
The user may also pass a function as argument; such function should take as input
the parameters passed to argument design_pars and return either a logical
vector or a vector of 0s and 1s,  where TRUE or 1 indicate sampled
units and FALSE or 0 indicate non-sample units.
The length of such vector must be equal to the length of x
if units is not specified, otherwise it must have the same length of units.
When write_on_file = TRUE, specifying a value for aurgument by
will produce intermediate files with approximate inclusion probabilities every
by number of replications. E.g., if replications=1e06 and by=5e05,
two output files will be created: one with estimates at 5e05
and one at 1e06 replications.
This option is particularly useful to assess convergence of the estimates.
Value
A matrix of estimated inclusion probabilities if as_data_frame=FALSE,
otherwise a data.frame with three columns: the first two indicate the ids of the
the couple of units, while the third one contains the joint-inclusion probability
values. Please, note that when as_data_frame=TRUE, first-order
inclusion probabilities are not returned.
References
Fattorini, L. 2006. Applying the Horvitz-Thompson criterion in complex designs: A computer-intensive perspective for estimating inclusion probabilities. Biometrika 93 (2), 269–278
Examples
### Generate population data ---
N <- 20; n<-5
set.seed(0)
x <- rgamma(N, scale=10, shape=5)
y <- abs( 2*x + 3.7*sqrt(x) * rnorm(N) )
pik  <- n * x/sum(x)
### Approximate joint-inclusion probabilities
pikl <- jip_MonteCarlo(x=pik, n = n, replications = 100, design = "brewer")
pikl <- jip_MonteCarlo(x=pik, n = n, replications = 100, design = "tille")
pikl <- jip_MonteCarlo(x=pik, n = n, replications = 100, design = "maxEntropy")
pikl <- jip_MonteCarlo(x=pik, n = n, replications = 100, design = "randomSystematic")
pikl <- jip_MonteCarlo(x=pik, n = n, replications = 100, design = "systematic")
pikl <- jip_MonteCarlo(x=pik, n = n, replications = 100, design = "sampford")
pikl <- jip_MonteCarlo(x=pik, n = n, replications = 100, design = "poisson")
#Use an external function to draw samples
pikl <- jip_MonteCarlo(x=pik, n=n, replications=100,
                       design = sampling::UPmidzuno, design_pars = list(pik=pik))
#Write output on file after 50 and 100 replications
pikl <- jip_MonteCarlo(x=pik, n = n, replications = 100, design = "brewer",
                       write_on_file = TRUE, filename="test.txt", path=tempdir(), by = 50 )
Tillé's approximation of joint-inclusion probabilities
Description
Compute the approximation of joint-inclusion probabilities by means of the Iterative Proportional Fitting Procedure (IPFP) proposed by Tillé (1996)
Usage
jip_Tille(pik, eps = 1e-06, maxIter = 1000)
Arguments
| pik | numeric vector of first-order inclusion probabilities for all population units. | 
| eps | tolerance value for the convergence of the fixed-point procedure | 
| maxIter | a scalar indicating the maximum number of iterations for the fixed-point procedure | 
Approximate Joint-Inclusion Probabilities
Description
Approximations of joint-inclusion probabilities by means of first-order inclusion probabilities.
Usage
jip_approx(pik, method)
Arguments
| pik | numeric vector of first-order inclusion probabilities for all population units. | 
| method | string representing one of the available approximation methods. | 
Details
Available methods are "Hajek", "HartleyRao", "Tille",
"Brewer1","Brewer2","Brewer3", and "Brewer4".
Note that these methods were derived for high-entropy sampling designs,
therefore they could have low performance under different designs.
Hájek (1964) approximation [method="Hajek"] is derived under Maximum Entropy sampling design
and is given by
\tilde{\pi}_{ij} = \pi_i\pi_j \frac{1 - (1-\pi_i)(1-\pi_j)}{d} 
where d = \sum_{i\in U} \pi_i(1-\pi_i) 
Hartley and Rao (1962) proposed the following approximation under
randomised systematic sampling [method="HartleyRao"]:
\tilde{\pi}_{ij} = \frac{n-1}{n} \pi_i\pi_j + \frac{n-1}{n^2} (\pi_i^2 \pi_j + \pi_i \pi_j^2)
      - \frac{n-1}{n^3}\pi_i\pi_j \sum_{i\in U} \pi_j^2
 + \frac{2(n-1)}{n^3} (\pi_i^3 \pi_j + \pi_i\pi_j^3 + \pi_i^2 \pi_j^2)
      - \frac{3(n-1)}{n^4} (\pi_i^2 \pi_j + \pi_i\pi_j^2) \sum_{i \in U}\pi_i^2
+ \frac{3(n-1)}{n^5} \pi_i\pi_j \biggl( \sum_{i\in U} \pi_i^2 \biggr)^2
      - \frac{2(n-1)}{n^4} \pi_i\pi_j \sum_{i \in U} \pi_j^3  
Tillé (1996) proposed the approximation \tilde{\pi}_{ij} = \beta_i\beta_j,
where the coefficients \beta_i are computed iteratively through the
following procedure [method="Tille"]:
-  \beta_i^{(0)} = \pi_i, \,\, \forall i\in U
-  \beta_i^{(2k-1)} = \frac{(n-1)\pi_i}{\beta^{(2k-2)} - \beta_i^{(2k-2)}}
-  \beta_i^{2k} = \beta_i^{(2k-1)} \Biggl( \frac{n(n-1)}{(\beta^(2k-1))^2 - \sum_{i\in U} (\beta_k^{(2k-1)})^2 } \Biggr)^(1/2)
with \beta^{(k)} = \sum_{i\in U} \beta_i^{i}, \,\, k=1,2,3, \dots 
Finally, Brewer (2002) and Brewer and Donadio (2003) proposed four approximations, which are defined by the general form
\tilde{\pi}_{ij} = \pi_i\pi_j (c_i + c_j)/2  
where the c_i determine the approximation used:
- Equation (9) [ - method="Brewer1"]:- c_i = (n-1) / (n-\pi_i)
- Equation (10) [ - method="Brewer2"]:- c_i = (n-1) / \Bigl(n- n^{-1}\sum_{i\in U}\pi_i^2 \Bigr)
- Equation (11) [ - method="Brewer3"]:- c_i = (n-1) / \Bigl(n - 2\pi_i + n^{-1}\sum_{i\in U}\pi_i^2 \Bigr)
- Equation (18) [ - method="Brewer4"]:- c_i = (n-1) / \Bigl(n - (2n-1)(n-1)^{-1}\pi_i + (n-1)^{-1}\sum_{i\in U}\pi_i^2 \Bigr)
Value
A symmetric matrix of inclusion probabilities, which diagonal is the vector of first-order inclusion probabilities.
References
Hartley, H.O.; Rao, J.N.K., 1962. Sampling With Unequal Probability and Without Replacement. The Annals of Mathematical Statistics 33 (2), 350-374.
Hájek, J., 1964. Asymptotic Theory of Rejective Sampling with Varying Probabilities from a Finite Population. The Annals of Mathematical Statistics 35 (4), 1491-1523.
Tillé, Y., 1996. Some Remarks on Unequal Probability Sampling Designs Without Replacement. Annals of Economics and Statistics 44, 177-189.
Brewer, K.R.W.; Donadio, M.E., 2003. The High Entropy Variance of the Horvitz-Thompson Estimator. Survey Methodology 29 (2), 189-196.
Examples
### Generate population data ---
N <- 20; n<-5
set.seed(0)
x <- rgamma(N, scale=10, shape=5)
y <- abs( 2*x + 3.7*sqrt(x) * rnorm(N) )
pik  <- n * x/sum(x)
### Approximate joint-inclusion probabilities ---
pikl <- jip_approx(pik, method='Hajek')
pikl <- jip_approx(pik, method='HartleyRao')
pikl <- jip_approx(pik, method='Tille')
pikl <- jip_approx(pik, method='Brewer1')
pikl <- jip_approx(pik, method='Brewer2')
pikl <- jip_approx(pik, method='Brewer3')
pikl <- jip_approx(pik, method='Brewer4')
Conditional Poisson Sampling (maximum entropy sampling)
Description
Draw a sample by means of Conditional Poisson Sampling
Usage
maxEntropy(pik, N, q)
Arguments
| pik | vector of first-order inclusion probabilities | 
| N | population size (excluding self-selecting units) | 
| q | matrix of selection probabilities, computed by means of function
 | 
Note
this functions is a modified version of function UPmaxentropy,
in the sampling package.
Conditional Poisson Sampling - compute selection probabilities
Description
Compute matrix of selection probabilities for Conditional Poisson Sampling
Usage
pre_CPS(pik)
Arguments
| pik | vector of first-order inclusion probabilities | 
Note
this functions is a modified version of function UPmaxentropy,
in the sampling package.
Tillé's elimination procedure - elimination probabilities
Description
Computes a matrix with elimination probabilities for each step of Tillé's elimination procedure
Usage
pre_tille(pik)
Arguments
| pik | vector of first-order inclusion probabilities | 
Rao-Sampford sampling
Description
Draw a sample by means of Rao-Sampford sampling
Usage
sampford(pik, n, N, s, list)
Arguments
| pik | vector of first-order inclusion probabilities | 
| n | sample size | 
| N | population size (excluding self-selecting units) | 
| s | vector of length N, with 1s at the positions of self-selecting units | 
| list | vector with positions of self selcting units | 
Note
this function is a modified version of function UPsampford,
in the sampling package.
Save partial results
Description
Write joint inclusion probability estimates on a file every by replications
Usage
save_output(
  iteration,
  design_name,
  counts,
  units,
  filename,
  path,
  status,
  as_data_frame
)
Arguments
| iteration | integer indicating the iterations the simulation is at | 
| design_name | string indicating the name of the sampling design to include in the filename | 
| counts | matrix with number of occurrences of couple of units up to current replication | 
| units | id of units for which output should be saved | 
| filename | name of the file to write on disk | 
| path | path to the directory where the file should be saved | 
| status | 
 | 
| as_data_frame | logical, should output be in form of a data frame? | 
Tillé's elimination procedure
Description
Draw a sample by means of Tillé's elimination procedure
Usage
tille(pik, n, N, s, list, pmat)
Arguments
| pik | vector of first-order inclusion probabilities | 
| n | sample size (excluding self-selecting units) | 
| N | population size (excluding self-selecting units) | 
| s | vector of length N, with 1s at the positions of self-selecting units | 
| list | vector with positions of self selcting units | 
| pmat | matrix of dimension $(N-n)xN, where each row has elimination probabilities for population units for one step of the procedure. |