cdm_reader_mapper.DupDetect.get_duplicates

cdm_reader_mapper.DupDetect.get_duplicates#

DupDetect.get_duplicates(keep='first', limit='default', equal_musts=None, overwrite=True)[source]#

Identify duplicate matches based on the comparison matrix.

Parameters:
  • keep (str or int) – Which entry to keep: ‘first’, ‘last’, or -1, 0.

  • limit (str or float, optional) – Threshold of total similarity score to consider as duplicate.

  • equal_musts (str or list[str], optional) – Columns that must exactly match.

  • overwrite (bool) – Whether to recompute matches if already calculated.

Return type:

DataFrame

Returns:

pd.DataFrame – DataFrame containing matched duplicates.