--- title: "Duplication analysis" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Duplication analysis} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(scrutiny) ``` You can use scrutiny to detect duplicate values in any dataset. Duplicates can go a long way in assessing the reliability of published research. This vignette walks you through scrutiny's tools for detecting, counting, and summarizing duplicates. It uses the `pigs4` dataset as a simple example: ```{r} pigs4 ``` ## Frequency tabulation with `duplicate_count()` A good first step is to create a frequency table. To do so, use `duplicate_count()`: ```{r} pigs4 %>% duplicate_count() ``` It returns a tibble (data frame) that lists each unique `value`. The tibble is ordered by the `frequency` of values in the input data frame, so the values that appear most often are at the top. The `locations` are the names of all the columns in which a given value appears. They are counted by `locations_n`. For example, `5.17` is the most frequent value in `pigs4`. It appears 3 times (`frequency`), namely in the `snout`, `tail`, and `wings` columns; so `locations_n` is also `3`. The next most frequent value is `4.22` which appears twice, but both of these instances are in the `snout` column, so `locations_n` is `1`. Run `audit()` after `duplicate_count()` to get summary statistics for the two numeric columns, `frequency` and `locations_n`: ```{r} pigs4 %>% duplicate_count() %>% audit() ``` ## Counting by column pair with `duplicate_count_colpair()` Sometimes, a sequence of data may be repeated in multiple columns. `duplicate_count_colpair()` helps find such cases: ```{r} pigs4 %>% duplicate_count_colpair() ``` `x` and `y` represent all combinations of columns in `pigs4`. The `count` is the number of values that appear in both respective columns. `total_x` and `total_y` are the numbers of non-missing values in the original columns listed under `x` and `y`. Similarly, `rate_x` is the rate of `x` values that also appear in `y`, and `rate_y` is the rate of `y` values that also appear in `x`. If there are no missing values, `total_x` is the same as `total_y`, and `rate_x` is the same as `rate_y`. Here, `snout` and `tail` are the column pair with the most overlap: 2 out of 5 values are the same, a duplication rate of 0.4. Again, you can call `audit()` for summary statistics: ```{r} pigs4 %>% duplicate_count_colpair() %>% audit() ``` ## Counting by observation with `duplicate_tally()` Unlike the other two functions, `duplicate_tally()` largely preserves the structure of the original data frame. It only adds a column ending on `_n` next to each original column. The new columns count how often the values to their left appear in the data frame as a whole: ```{r} pigs4 %>% duplicate_tally() ``` In `snout`, for example, `4.22` appears twice, so its entries in `snout_n` are `2`. But likewise, `8.13` appears in both `snout` and `tail`, so both observations are marked `2` in the `_n` columns. When following up `duplicate_tally()` with `audit()`, it shows summary statistics for each `_n` column. The last row summarizes all of these columns together: ```{r} pigs4 %>% duplicate_tally() %>% audit() ```