cdm_reader_mapper.duplicate_check

cdm_reader_mapper.duplicate_check#

cdm_reader_mapper.duplicate_check(data, method='SortedNeighbourhood', method_kwargs=None, compare_kwargs=None, table_name=None, ignore_columns=None, ignore_entries=None, offsets=None, reindex_by_null=True, null_label='null')[source]#

Run a duplicate check on a dataset using recordlinkage.

Returns a DupDetect object.

Parameters:
  • data (pandas.DataFrame) – Dataset for duplicate check.

  • method (str) – Duplicate check method for recordlinkage. Default: SortedNeighbourhood

  • method_kwargs (dict, optional) – Keyword arguments for recordlinkage duplicate check. Default: _method_kwargs

  • compare_kwargs (dict, optional) – Keyword arguments for recordlinkage.Compare object. Default: _compare_kwargs

  • table_name (str, optional) – Name of the CDM table to be selected from data.

  • ignore_columns (str or list, optional) – Name of data columns to be ignored for duplicate check.

  • ignore_entries (dict, optional) – Key: Column name Value: value to be ignored E.g. offsets={“station_speed”: null}

  • offsets (dict, optional) – Change offsets for recordlinkage Compare object. Key: Column name Value: new offset E.g. offsets={“latitude”: 0.1}

  • reindex_by_null (bool, optional) – If True data is re-indexed in ascending order according to the number of nulls in each row.

  • null_label (str, optional) – Null label which is used if reindex_by_null is True.

Return type:

DupDetect

Returns:

cdm_reader_mapper.DupDetect