cdm_reader_mapper.duplicate_check#

cdm_reader_mapper.duplicate_check(data, method='SortedNeighbourhood', method_kwargs=None, compare_kwargs=None, table_name=None, ignore_columns=None, ignore_entries=None, offsets=None, reindex_by_null=True, null_label='null')[source]#

Run a duplicate check on a dataset using recordlinkage.

Returns a DupDetect object.

Parameters:

data (pandas.DataFrame) – Dataset for duplicate check.
method (str) – Duplicate check method for recordlinkage. Default: SortedNeighbourhood
method_kwargs (dict, optional) – Keyword arguments for recordlinkage duplicate check. Default: _method_kwargs
compare_kwargs (dict, optional) – Keyword arguments for recordlinkage.Compare object. Default: _compare_kwargs
table_name (str, optional) – Name of the CDM table to be selected from data.
ignore_columns (str or list, optional) – Name of data columns to be ignored for duplicate check.
ignore_entries (dict, optional) – Key: Column name Value: value to be ignored E.g. offsets={“station_speed”: null}
offsets (dict, optional) – Change offsets for recordlinkage Compare object. Key: Column name Value: new offset E.g. offsets={“latitude”: 0.1}
reindex_by_null (bool, optional) – If True data is re-indexed in ascending order according to the number of nulls in each row.
null_label (str, optional) – Null label which is used if reindex_by_null is True.

Return type:

DupDetect

Returns:

cdm_reader_mapper.DupDetect

cdm_reader_mapper.duplicate_check

Contents

cdm_reader_mapper.duplicate_check#