---
title: "Duplication analysis"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Duplication analysis}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(scrutiny)
```

You can use scrutiny to detect duplicate values in any dataset. Duplicates can go a long way in assessing the reliability of published research.

This vignette walks you through scrutiny's tools for detecting, counting, and summarizing duplicates. It uses the `pigs4` dataset as a simple example:

```{r}
pigs4
```

## Frequency tabulation with `duplicate_count()`

A good first step is to create a frequency table. To do so, use `duplicate_count()`:

```{r}
pigs4 %>% 
  duplicate_count()
```

It returns a tibble (data frame) that lists each unique `value`. The tibble is ordered by the `frequency` of values in the input data frame, so the values that appear most often are at the top. The `locations` are the names of all the columns in which a given value appears. They are counted by `locations_n`.

For example, `5.17` is the most frequent value in `pigs4`. It appears 3 times (`frequency`), namely in the `snout`, `tail`, and `wings` columns; so `locations_n` is also `3`. The next most frequent value is `4.22` which appears twice, but both of these instances are in the `snout` column, so `locations_n` is `1`.

Run `audit()` after `duplicate_count()` to get summary statistics for the two numeric columns, `frequency` and `locations_n`:

```{r}
pigs4 %>% 
    duplicate_count() %>% 
    audit()
```

## Counting by column pair with `duplicate_count_colpair()`

Sometimes, a sequence of data may be repeated in multiple columns. `duplicate_count_colpair()` helps find such cases:

```{r}
pigs4 %>% 
  duplicate_count_colpair()
```

`x` and `y` represent all combinations of columns in `pigs4`. The `count` is the number of values that appear in both respective columns. `total_x` and `total_y` are the numbers of non-missing values in the original columns listed under `x` and `y`. Similarly, `rate_x` is the rate of `x` values that also appear in `y`, and `rate_y` is the rate of `y` values that also appear in `x`. If there are no missing values, `total_x` is the same as `total_y`, and `rate_x` is the same as `rate_y`.

Here, `snout` and `tail` are the column pair with the most overlap: 2 out of 5 values are the same, a duplication rate of 0.4.

Again, you can call `audit()` for summary statistics:

```{r}
pigs4 %>% 
  duplicate_count_colpair() %>% 
  audit()
```

## Counting by observation with `duplicate_tally()`

Unlike the other two functions, `duplicate_tally()` largely preserves the structure of the original data frame. It only adds a column ending on `_n` next to each original column. The new columns count how often the values to their left appear in the data frame as a whole:

```{r}
pigs4 %>% 
    duplicate_tally()
```

In `snout`, for example, `4.22` appears twice, so its entries in `snout_n` are `2`. But likewise, `8.13` appears in both `snout` and `tail`, so both observations are marked `2` in the `_n` columns.

When following up `duplicate_tally()` with `audit()`, it shows summary statistics for each `_n` column. The last row summarizes all of these columns together:

```{r}
pigs4 %>% 
    duplicate_tally() %>% 
    audit()
```