cdm_reader_mapper.duplicates package¶
Climate Data Model (CDM) mapper package.
Submodules¶
cdm_reader_mapper.duplicates._duplicate_settings module¶
Settings for duplicate check.
- class cdm_reader_mapper.duplicates._duplicate_settings.Compare(features=[], n_jobs=1, indexing_type='label', **kwargs)[source]¶
Bases:
recordlinkage.base.BaseCompareClass to compare record pairs with efficiently.
Class to compare the attributes of candidate record pairs. The
Compareclass has methods likestring,exactandnumericto initialise the comparing of the records. Thecomputemethod is used to start the actual comparing.Example
Consider two historical datasets with census data to link. The datasets are named
census_data_1980andcensus_data_1990. The MultiIndexcandidate_pairscontains the record pairs to compare. The record pairs are compared on the first name, last name, sex, date of birth, address, place, and income:# initialise class comp = recordlinkage.Compare() # initialise similarity measurement algorithms comp.string('first_name', 'name', method='jarowinkler') comp.string('lastname', 'lastname', method='jarowinkler') comp.exact('dateofbirth', 'dob') comp.exact('sex', 'sex') comp.string('address', 'address', method='levenshtein') comp.exact('place', 'place') comp.numeric('income', 'income') # the method .compute() returns the DataFrame with the feature vectors. comp.compute(candidate_pairs, census_data_1980, census_data_1990)
- Parameters:
features (
list) – List of compare algorithms.n_jobs (
integer,optional (default=1)) – The number of jobs to run in parallel for comparing of record pairs. If -1, then the number of jobs is set to the number of cores.indexing_type (
string,optional (default=``’label’:py:class:`)`) -- The indexing type. The MultiIndex is used to index the DataFrame(s). This can be done with pandas ``.locor with.iloc. Use the value ‘label’ to make use of.locand ‘position’ to make use of.iloc. The value ‘position’ is only available when the MultiIndex consists of integers. The value ‘position’ is much faster.
- Variables:
features (
list) – A list of algorithms to create features.
- date(*args, **kwargs)[source]¶
Compare attributes of pairs with date algorithm.
Shortcut of
recordlinkage.compare.Date:from recordlinkage.compare import Date indexer = recordlinkage.Compare() indexer.add(Date())
- date2(*args, **kwargs)¶
New method for
rl.Compareobject usingDate2object.
- exact(*args, **kwargs)[source]¶
Compare attributes of pairs exactly.
Shortcut of
recordlinkage.compare.Exact:from recordlinkage.compare import Exact indexer = recordlinkage.Compare() indexer.add(Exact())
- geo(*args, **kwargs)[source]¶
Compare attributes of pairs with geo algorithm.
Shortcut of
recordlinkage.compare.Geographic:from recordlinkage.compare import Geographic indexer = recordlinkage.Compare() indexer.add(Geographic())
cdm_reader_mapper.duplicates.duplicates module¶
Common Data Model (CDM) pandas duplicate check.
- class cdm_reader_mapper.duplicates.duplicates.Comparer(data, method, method_kwargs, compare_kwargs, pairs_df=None, convert_data=False)[source]¶
Bases:
objectWrapper around recordlinkage.Compare to compute pairwise comparisons on a DataFrame.
This class initializes a recordlinkage indexer and Compare object, optionally converting the data types before computing the comparisons.
- Parameters:
data (
pd.DataFrame) – The dataset to compare.method (
str) – The indexing method from recordlinkage.index, e.g., ‘SortedNeighbourhood’.method_kwargs (
dict) – Keyword arguments to pass to the indexing method.compare_kwargs (
dict) – Dictionary specifying columns and comparison methods for recordlinkage.Compare.pairs_df (
list[pd.DataFrame], optional) – Optional pre-split DataFrames to pass to the indexer. Defaults to [data].convert_data (
bool, defaultFalse) – Whether to convert data using compare_kwargs conversion dictionary.
- class cdm_reader_mapper.duplicates.duplicates.DupDetect(data, compared, method, method_kwargs, compare_kwargs)[source]¶
Bases:
objectClass to detect, flag, and remove duplicate entries in a DataFrame using a comparison matrix from recordlinkage.
- Parameters:
data (
pd.DataFrame) – Original dataset.compared (
pd.DataFrame) – Comparison matrix of the dataset.method (
str) – Duplicate detection method used for recordlinkage indexing.method_kwargs (
dict) – Keyword arguments for recordlinkage indexing method.compare_kwargs (
dict) – Keyword arguments used for recordlinkage.Compare.
- flag_duplicates(keep='first', limit='default', equal_musts=None)[source]¶
Get result dataset with flagged duplicates.
- Parameters:
keep (
strorint, default:first) – Which entry should be kept in result dataset.limit (
str,intorfloat, optional) – Limit of total score that as to be exceeded to be declared as a duplicate. Defaults to .991.equal_musts (
strorlist, optional) – Hashable of column name(s) that must totally be equal to be declared as a duplicate. Default: All column names found in method_kwargs.
- Return type:
- Returns:
pd.DataFrame– Input DataFrame with flagged duplicates, including duplicate_status and quality_flag.
References
- get_duplicates(keep='first', limit='default', equal_musts=None, overwrite=True)[source]¶
Identify duplicate matches based on the comparison matrix.
- Parameters:
keep (
strorint) – Which entry to keep: ‘first’, ‘last’, or -1, 0.limit (
strorfloat, optional, default: default) – Threshold of total similarity score to consider as duplicate.equal_musts (
strorlist[str], optional) – Columns that must exactly match.overwrite (
bool, default:True) – Whether to recompute matches if already calculated.
- Return type:
- Returns:
pd.DataFrame– DataFrame containing matched duplicates.
- cdm_reader_mapper.duplicates.duplicates.add_duplicates(df, dups)[source]¶
Add duplicate information to the DataFrame based on the dups table.
- Parameters:
df (
pd.DataFrame) – DataFrame containing a ‘report_id’ column.dups (
pd.DataFrame) – DataFrame where the index corresponds to rows in df and the values are lists of duplicate indices or duplicate IDs.
- Return type:
- Returns:
pd.DataFrame– A new DataFrame with a ‘duplicates’ column containing duplicates as a sorted string list, e.g., “{ID1,ID2}”.
Notes
If a row has no duplicates, its ‘duplicates’ column is left unchanged.
Supports duplicates represented either by IDs (str) or by indices (int) of report_id.
- cdm_reader_mapper.duplicates.duplicates.add_history(df, indexes)[source]¶
Append duplicate information to the ‘history’ column of a DataFrame.
- Parameters:
df (
pd.DataFrame) – The DataFrame containing a ‘history’ column.indexes (
list[int]orpd.Index) – Row indexes where history should be updated.
- Return type:
- Returns:
pd.DataFrame– A new DataFrame with updated ‘history’ column for the selected rows.
Notes
If ‘history’ column does not exist, it will be created with empty strings.
Each message is prefixed with a UTC timestamp in “YYYY-MM-DD HH:MM:SS” format.
- cdm_reader_mapper.duplicates.duplicates.add_report_quality(df, indexes_bad)[source]¶
Update the ‘report_quality’ column in a DataFrame for bad reports.
- cdm_reader_mapper.duplicates.duplicates.change_offsets(dic, dic_o)[source]¶
Update the ‘offset’ value in compare dictionary kwargs.
- cdm_reader_mapper.duplicates.duplicates.convert_series(df, conversion)[source]¶
Convert data types in Dataframe.
- cdm_reader_mapper.duplicates.duplicates.duplicate_check(data, method='SortedNeighbourhood', method_kwargs=None, compare_kwargs=None, table_name=None, ignore_columns=None, ignore_entries=None, offsets=None, reindex_by_null=True, null_label='null')[source]¶
Run a duplicate check on a dataset using recordlinkage.
Returns a DupDetect object.
- Parameters:
data (
pandas.DataFrame) – Dataset for duplicate check.method (
str, default:SortedNeighbourhood) – Duplicate check method for recordlinkage.method_kwargs (
dict, optional) – Keyword arguments for recordlinkage duplicate check. Defaults to _method_kwargs.compare_kwargs (
dict, optional) – Keyword arguments for recordlinkage.Compare object. Defaults to _compare_kwargs.table_name (
str, optional) – Name of the CDM table to be selected from data.ignore_columns (
strorlist, optional) – Name of data columns to be ignored for duplicate check.ignore_entries (
dict, optional) – Key: Column name. Value: value to be ignored. E.g. offsets={“station_speed”: null}.offsets (
dict, optional) – Change offsets for recordlinkage Compare object. Key: Column name. Value: new offset. E.g. offsets={“latitude”: 0.1}.reindex_by_null (
bool, optional) – If True data is re-indexed in ascending order according to the number of nulls in each row.null_label (
str, optional) – Null label which is used if reindex_by_null is True.
- Return type:
- Returns:
cdm_reader_mapper.DupDetect– A DupDetect instance.
- cdm_reader_mapper.duplicates.duplicates.reindex_nulls(df, null_label)[source]¶
Reindex a DataFrame in ascending order based on the number of ‘null’ strings in each row.
- Parameters:
df (
pd.DataFrame) – Input DataFrame. Cells with the string “null” are counted as nulls.null_label (
Any) – Missing value representative.
- Return type:
- Returns:
pd.DataFrame– DataFrame reindexed so that rows with fewer ‘null’ values appear first. Original row order is preserved for rows with the same null count.
- cdm_reader_mapper.duplicates.duplicates.remove_ignores(dic, columns)[source]¶
Remove dictionary entries where keys or values match ignored columns.