Title: | Deduplication Across Multiple Columns |
---|---|
Description: | Duplicated data can exist in different rows and columns and user may need to treat observations (rows) connected by duplicated data as one observation, e.g. companies can belong to one family (and thus: be one company) by sharing some telephone numbers. This package allows to find connected rows based on data on chosen columns and collapse it into one row. |
Authors: | Grzegorz Smoliński [aut, cre] |
Maintainer: | Grzegorz Smoliński <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.0 |
Built: | 2025-02-18 04:10:59 UTC |
Source: | https://github.com/gsmolinski/dedupewider |
Collapse many rows connected by duplicated data (which can exist in different rows and columns) into one, based on data in chosen columns, optionally putting non-consistent data into newly created additional columns.
dedupe_wide( x, cols_dedupe, cols_expand = NULL, max_new_cols = NULL, enable_drop = TRUE )
dedupe_wide( x, cols_dedupe, cols_expand = NULL, max_new_cols = NULL, enable_drop = TRUE )
x |
A data.frame without column named '....idx' and any column which ends by four dots and number (e.g. 'column....2'). |
cols_dedupe |
A character vector of length min. 2 of columns' names in |
cols_expand |
A character vector of columns' names in |
max_new_cols |
A numeric vector length 1 or |
enable_drop |
A logical vector length 1: should given column be dropped if (after deduplication) contains only missing data ( |
Columns passed to cols_dedupe
must be atomic.
Row names will always be removed. If you want to preserve row names, simply put in into separate column. Note that if this column won't be passed to cols_expand
argument, only the one row name for duplicated rows will be preserved (row name closest to the top of the table).
Although duplicated
or unique
treats missing data (NA
) as duplicated data, this function do not do this (see second example below).
Type of columns passed to cols_dedupe
will be coerced to the most general type.
If duplicated data found - data.frame with changed columns' names and optionally additional columns (in some cases less columns, depends on enable_drop
argument). Otherwise data.frame without changes (except row names removed).
Internally, function is mainly based on data.table
functions and thus enabling parallel computation
is possible. To do this, just call setDTthreads
before calling dedupe_wide
function.
x <- data.frame(tel_1 = c(111, 222, 444, 555), tel_2 = c(222, 666, 666, 555), name = paste0("name", 1:4)) # rows 1, 2, 3 share the same phone numbers dedupe_wide(x, cols_dedupe = c("tel_1", "tel_2"), cols_expand = "name") # first three collapsed into one, for name4 kept only one phone number (555) # 'name1', 'name2', 'name3' kept in new columns y <- data.frame(tel_1 = c(777, 888, NA, NA), tel_2 = c(888, 777, NA, NA), name = paste0("name", 5:8)) # rows 3 and 4 has only missing data dedupe_wide(y, cols_dedupe = c("tel_1", "tel_2"), cols_expand = "name") # first two rows collapsed into one, nothing change for the rest of rows
x <- data.frame(tel_1 = c(111, 222, 444, 555), tel_2 = c(222, 666, 666, 555), name = paste0("name", 1:4)) # rows 1, 2, 3 share the same phone numbers dedupe_wide(x, cols_dedupe = c("tel_1", "tel_2"), cols_expand = "name") # first three collapsed into one, for name4 kept only one phone number (555) # 'name1', 'name2', 'name3' kept in new columns y <- data.frame(tel_1 = c(777, 888, NA, NA), tel_2 = c(888, 777, NA, NA), name = paste0("name", 5:8)) # rows 3 and 4 has only missing data dedupe_wide(y, cols_dedupe = c("tel_1", "tel_2"), cols_expand = "name") # first two rows collapsed into one, nothing change for the rest of rows
NA
across columns or rowsFor chosen columns, move NA
to right or left (i.e. across columns)
or to top or bottom (i.e. across rows).
na_move(data, cols = names(data), direction = "right")
na_move(data, cols = names(data), direction = "right")
data |
A data.frame without column named "....idx". |
cols |
A character vector of columns' names in |
direction |
A character vector of length 1 indicating where to move |
A data.frame with only these attributes preserved, which are returned by attributes
function used on object passed to data
parameter.
Type of columns passed to cols
will be coerced to the most general type, although sometimes when
column will contain only NA
, that column will be of type logical.
Internally, function is mainly based on data.table
functions and thus enabling parallel computation
is possible. To do this, just call setDTthreads
before calling na_move
function.
data <- data.frame(col1 = c(1, 2, 3), col2 = c(NA, NA, 4), col3 = c(5, NA, NA), col4 = c(6, 7, 8)) data na_move(data, c("col2", "col3", "col4"), direction = "right")
data <- data.frame(col1 = c(1, 2, 3), col2 = c(NA, NA, 4), col3 = c(5, NA, NA), col4 = c(6, 7, 8)) data na_move(data, c("col2", "col3", "col4"), direction = "right")