cdm_reader_mapper.duplicates package

Climate Data Model (CDM) mapper package.

Submodules

cdm_reader_mapper.duplicates._duplicate_settings module

Settings for duplicate check.

class cdm_reader_mapper.duplicates._duplicate_settings.Compare(features=[], n_jobs=1, indexing_type='label', **kwargs)[source]

Bases: recordlinkage.base.BaseCompare

Class to compare record pairs with efficiently.

Class to compare the attributes of candidate record pairs. The Compare class has methods like string, exact and numeric to initialise the comparing of the records. The compute method is used to start the actual comparing.

Example

Consider two historical datasets with census data to link. The datasets are named census_data_1980 and census_data_1990. The MultiIndex candidate_pairs contains the record pairs to compare. The record pairs are compared on the first name, last name, sex, date of birth, address, place, and income:

# initialise class
comp = recordlinkage.Compare()

# initialise similarity measurement algorithms
comp.string('first_name', 'name', method='jarowinkler')
comp.string('lastname', 'lastname', method='jarowinkler')
comp.exact('dateofbirth', 'dob')
comp.exact('sex', 'sex')
comp.string('address', 'address', method='levenshtein')
comp.exact('place', 'place')
comp.numeric('income', 'income')

# the method .compute() returns the DataFrame with the feature vectors.
comp.compute(candidate_pairs, census_data_1980, census_data_1990)
Parameters:
  • features (list) – List of compare algorithms.

  • n_jobs (integer, optional (default=1)) – The number of jobs to run in parallel for comparing of record pairs. If -1, then the number of jobs is set to the number of cores.

  • indexing_type (string, optional (default=``’label’:py:class:`)`) -- The indexing type. The MultiIndex is used to index the DataFrame(s). This can be done with pandas ``.loc or with .iloc. Use the value ‘label’ to make use of .loc and ‘position’ to make use of .iloc. The value ‘position’ is only available when the MultiIndex consists of integers. The value ‘position’ is much faster.

Variables:

features (list) – A list of algorithms to create features.

date(*args, **kwargs)[source]

Compare attributes of pairs with date algorithm.

Shortcut of recordlinkage.compare.Date:

from recordlinkage.compare import Date

indexer = recordlinkage.Compare()
indexer.add(Date())
date2(*args, **kwargs)

New method for rl.Compare object using Date2 object.

Parameters:
  • object (Compare) – Object to with the new method should be added.

  • *args (Any) – Positional argument for Date2.

  • **kwargs (Any) – Keyword-arguments for Date2.

Return type:

Compare

Returns:

Compare – Compare object with new method.

exact(*args, **kwargs)[source]

Compare attributes of pairs exactly.

Shortcut of recordlinkage.compare.Exact:

from recordlinkage.compare import Exact

indexer = recordlinkage.Compare()
indexer.add(Exact())
geo(*args, **kwargs)[source]

Compare attributes of pairs with geo algorithm.

Shortcut of recordlinkage.compare.Geographic:

from recordlinkage.compare import Geographic

indexer = recordlinkage.Compare()
indexer.add(Geographic())
numeric(*args, **kwargs)[source]

Compare attributes of pairs with numeric algorithm.

Shortcut of recordlinkage.compare.Numeric:

from recordlinkage.compare import Numeric

indexer = recordlinkage.Compare()
indexer.add(Numeric())
string(*args, **kwargs)[source]

Compare attributes of pairs with string algorithm.

Shortcut of recordlinkage.compare.String:

from recordlinkage.compare import String

indexer = recordlinkage.Compare()
indexer.add(String())

cdm_reader_mapper.duplicates.duplicates module

Common Data Model (CDM) pandas duplicate check.

class cdm_reader_mapper.duplicates.duplicates.Comparer(data, method, method_kwargs, compare_kwargs, pairs_df=None, convert_data=False)[source]

Bases: object

Wrapper around recordlinkage.Compare to compute pairwise comparisons on a DataFrame.

This class initializes a recordlinkage indexer and Compare object, optionally converting the data types before computing the comparisons.

Parameters:
  • data (pd.DataFrame) – The dataset to compare.

  • method (str) – The indexing method from recordlinkage.index, e.g., ‘SortedNeighbourhood’.

  • method_kwargs (dict) – Keyword arguments to pass to the indexing method.

  • compare_kwargs (dict) – Dictionary specifying columns and comparison methods for recordlinkage.Compare.

  • pairs_df (list[pd.DataFrame], optional) – Optional pre-split DataFrames to pass to the indexer. Defaults to [data].

  • convert_data (bool, default False) – Whether to convert data using compare_kwargs conversion dictionary.

class cdm_reader_mapper.duplicates.duplicates.DupDetect(data, compared, method, method_kwargs, compare_kwargs)[source]

Bases: object

Class to detect, flag, and remove duplicate entries in a DataFrame using a comparison matrix from recordlinkage.

Parameters:
  • data (pd.DataFrame) – Original dataset.

  • compared (pd.DataFrame) – Comparison matrix of the dataset.

  • method (str) – Duplicate detection method used for recordlinkage indexing.

  • method_kwargs (dict) – Keyword arguments for recordlinkage indexing method.

  • compare_kwargs (dict) – Keyword arguments used for recordlinkage.Compare.

flag_duplicates(keep='first', limit='default', equal_musts=None)[source]

Get result dataset with flagged duplicates.

Parameters:
  • keep (str or int, default: first) – Which entry should be kept in result dataset.

  • limit (str, int or float, optional) – Limit of total score that as to be exceeded to be declared as a duplicate. Defaults to .991.

  • equal_musts (str or list, optional) – Hashable of column name(s) that must totally be equal to be declared as a duplicate. Default: All column names found in method_kwargs.

Return type:

DataFrame

Returns:

pd.DataFrame – Input DataFrame with flagged duplicates, including duplicate_status and quality_flag.

References

get_duplicates(keep='first', limit='default', equal_musts=None, overwrite=True)[source]

Identify duplicate matches based on the comparison matrix.

Parameters:
  • keep (str or int) – Which entry to keep: ‘first’, ‘last’, or -1, 0.

  • limit (str or float, optional, default: default) – Threshold of total similarity score to consider as duplicate.

  • equal_musts (str or list[str], optional) – Columns that must exactly match.

  • overwrite (bool, default: True) – Whether to recompute matches if already calculated.

Return type:

DataFrame

Returns:

pd.DataFrame – DataFrame containing matched duplicates.

remove_duplicates(keep='first', limit='default', equal_musts=None)[source]

Remove duplicate entries from the dataset.

Parameters:
  • keep (str or int) – Which entry to keep (‘first’ or ‘last’).

  • limit (str or float, optional) – Minimum similarity score to declare duplicates.

  • equal_musts (str or list[str], optional) – Columns that must exactly match.

Return type:

DataFrame

Returns:

pd.DataFrame – Dataset without duplicates.

cdm_reader_mapper.duplicates.duplicates.add_duplicates(df, dups)[source]

Add duplicate information to the DataFrame based on the dups table.

Parameters:
  • df (pd.DataFrame) – DataFrame containing a ‘report_id’ column.

  • dups (pd.DataFrame) – DataFrame where the index corresponds to rows in df and the values are lists of duplicate indices or duplicate IDs.

Return type:

DataFrame

Returns:

pd.DataFrame – A new DataFrame with a ‘duplicates’ column containing duplicates as a sorted string list, e.g., “{ID1,ID2}”.

Notes

  • If a row has no duplicates, its ‘duplicates’ column is left unchanged.

  • Supports duplicates represented either by IDs (str) or by indices (int) of report_id.

cdm_reader_mapper.duplicates.duplicates.add_history(df, indexes)[source]

Append duplicate information to the ‘history’ column of a DataFrame.

Parameters:
  • df (pd.DataFrame) – The DataFrame containing a ‘history’ column.

  • indexes (list[int] or pd.Index) – Row indexes where history should be updated.

Return type:

DataFrame

Returns:

pd.DataFrame – A new DataFrame with updated ‘history’ column for the selected rows.

Notes

  • If ‘history’ column does not exist, it will be created with empty strings.

  • Each message is prefixed with a UTC timestamp in “YYYY-MM-DD HH:MM:SS” format.

cdm_reader_mapper.duplicates.duplicates.add_report_quality(df, indexes_bad)[source]

Update the ‘report_quality’ column in a DataFrame for bad reports.

Parameters:
  • df (pd.DataFrame) – DataFrame containing at least a ‘report_quality’ column.

  • indexes_bad (iterable of int) – Row indices in the DataFrame to mark as bad quality (value=1).

Return type:

DataFrame

Returns:

pd.DataFrame – DataFrame with updated ‘report_quality’ column.

cdm_reader_mapper.duplicates.duplicates.change_offsets(dic, dic_o)[source]

Update the ‘offset’ value in compare dictionary kwargs.

Parameters:
  • dic (dict) – Original compare dictionary.

  • dic_o (dict) – Dictionary mapping column names to new offsets.

Return type:

dict[Any, Any]

Returns:

dict – Updated compare dictionary with modified offsets.

cdm_reader_mapper.duplicates.duplicates.convert_series(df, conversion)[source]

Convert data types in Dataframe.

Parameters:
  • df (pd.DataFrame) – Input DataFrame.

  • conversion (dict) – Conversion dictionary conating columns and new data type as key-value pairs.

Return type:

DataFrame

Returns:

pd.DataFrame – DataFrame with converted data types.

cdm_reader_mapper.duplicates.duplicates.duplicate_check(data, method='SortedNeighbourhood', method_kwargs=None, compare_kwargs=None, table_name=None, ignore_columns=None, ignore_entries=None, offsets=None, reindex_by_null=True, null_label='null')[source]

Run a duplicate check on a dataset using recordlinkage.

Returns a DupDetect object.

Parameters:
  • data (pandas.DataFrame) – Dataset for duplicate check.

  • method (str, default: SortedNeighbourhood) – Duplicate check method for recordlinkage.

  • method_kwargs (dict, optional) – Keyword arguments for recordlinkage duplicate check. Defaults to _method_kwargs.

  • compare_kwargs (dict, optional) – Keyword arguments for recordlinkage.Compare object. Defaults to _compare_kwargs.

  • table_name (str, optional) – Name of the CDM table to be selected from data.

  • ignore_columns (str or list, optional) – Name of data columns to be ignored for duplicate check.

  • ignore_entries (dict, optional) – Key: Column name. Value: value to be ignored. E.g. offsets={“station_speed”: null}.

  • offsets (dict, optional) – Change offsets for recordlinkage Compare object. Key: Column name. Value: new offset. E.g. offsets={“latitude”: 0.1}.

  • reindex_by_null (bool, optional) – If True data is re-indexed in ascending order according to the number of nulls in each row.

  • null_label (str, optional) – Null label which is used if reindex_by_null is True.

Return type:

DupDetect

Returns:

cdm_reader_mapper.DupDetect – A DupDetect instance.

cdm_reader_mapper.duplicates.duplicates.reindex_nulls(df, null_label)[source]

Reindex a DataFrame in ascending order based on the number of ‘null’ strings in each row.

Parameters:
  • df (pd.DataFrame) – Input DataFrame. Cells with the string “null” are counted as nulls.

  • null_label (Any) – Missing value representative.

Return type:

DataFrame

Returns:

pd.DataFrame – DataFrame reindexed so that rows with fewer ‘null’ values appear first. Original row order is preserved for rows with the same null count.

cdm_reader_mapper.duplicates.duplicates.remove_ignores(dic, columns)[source]

Remove dictionary entries where keys or values match ignored columns.

Parameters:
  • dic (dict) – Original dictionary to filter.

  • columns (str or list[str]) – Column(s) to ignore.

Return type:

dict[Any, Any]

Returns:

dict – Filtered dictionary without the ignored columns.

cdm_reader_mapper.duplicates.duplicates.set_comparer(compare_dict)[source]

Build a recordlinkage Compare object with optional conversion dictionary.

Parameters:

compare_dict (dict) – Dictionary of columns to compare, e.g. {“column_name”: {“method”: “exact” | “numeric” | “date2”, “kwargs”: {…}}}.

Return type:

Compare

Returns:

recordlinkage.Compare – Compare object with added comparison methods and a ‘conversion’ attribute.