cdm_reader_mapper.duplicate_check#
- cdm_reader_mapper.duplicate_check(data, method='SortedNeighbourhood', method_kwargs=None, compare_kwargs=None, table_name=None, ignore_columns=None, ignore_entries=None, offsets=None, reindex_by_null=True, null_label='null')[source]#
Run a duplicate check on a dataset using recordlinkage.
Returns a DupDetect object.
- Parameters:
data (
pandas.DataFrame) – Dataset for duplicate check.method (
str) – Duplicate check method for recordlinkage. Default: SortedNeighbourhoodmethod_kwargs (
dict, optional) – Keyword arguments for recordlinkage duplicate check. Default: _method_kwargscompare_kwargs (
dict, optional) – Keyword arguments for recordlinkage.Compare object. Default: _compare_kwargstable_name (
str, optional) – Name of the CDM table to be selected from data.ignore_columns (
strorlist, optional) – Name of data columns to be ignored for duplicate check.ignore_entries (
dict, optional) – Key: Column name Value: value to be ignored E.g. offsets={“station_speed”: null}offsets (
dict, optional) – Change offsets for recordlinkage Compare object. Key: Column name Value: new offset E.g. offsets={“latitude”: 0.1}reindex_by_null (
bool, optional) – If True data is re-indexed in ascending order according to the number of nulls in each row.null_label (
str, optional) – Null label which is used if reindex_by_null is True.
- Return type:
- Returns:
cdm_reader_mapper.DupDetect