cdm_reader_mapper.common package¶

Common Data Model (CDM) reader and mapper common pandas operators.

class cdm_reader_mapper.common.ParquetStreamReader(source)¶

Bases: object

A wrapper that mimics pandas.io.parsers.TextFileReader.

Parameters:

source (iterable or iterator or callable) – Data source yielding pandas.DataFrame or pandas.Series objects. If a callable is provided, it must return a fresh iterator each time it is called (useful for copying/resetting the stream).

Variables:

columns (list or pandas.Index) – Column labels inferred from the first chunk.
dtypes (dict or pandas.Series) – Data types inferred from the first chunk.
attrs (dict) – User-defined metadata associated with the stream.

Notes

The stream is consumed as it is iterated.
Use copy() to create an independent iterator.
Some operations (e.g., read()) load all data into memory and may not be suitable for large datasets.

close()[source]¶

Close the stream and release resources.

Return type:: None

copy()[source]¶

Create an independent copy of the stream.

Return type:: ParquetStreamReader
Returns:: ParquetStreamReader – A new stream reader with independent iteration state.
Raises:: ValueError – If the stream has been closed.

property empty: bool¶

Check whether the stream has any remaining data.

Returns:: bool – True if the stream is exhausted, False otherwise.

Notes

This method creates a temporary copy of the stream to check for remaining elements without consuming the original.

get_chunk()[source]¶

Return the next available chunk.

This is equivalent to calling next(reader) and is provided for API compatibility with pandas readers.

Return type:: DataFrame | Series
Returns:: pandas.DataFrame or pandas.Series – The next chunk of data.

prepend(chunk)[source]¶

Push a chunk back onto the front of the stream.

Useful for peeking at the first chunk without losing it.

Parameters:: chunk (pandas.DataFrame or pandas.Series) – The chunk to prepend.
Return type:: None

read()[source]¶

Read all remaining data into a single DataFrame.

This consumes the entire stream and concatenates all remaining chunks into one DataFrame.

Return type:: DataFrame
Returns:: pandas.DataFrame – Concatenated result of all remaining chunks. Returns an empty DataFrame if the stream is exhausted.

Warning

This operation loads all data into memory and may not be suitable for large datasets.

reset_index(drop=False)[source]¶

Reset the index across all chunks to a continuous range.

Parameters:: drop (bool, default False) – If True, do not insert the old index as a column. If False, the new index is also inserted as a column named “index”.
Return type:: ParquetStreamReader
Returns:: ParquetStreamReader – A new stream reader with reindexed chunks.
Raises:: ValueError – If the stream has been closed.

class cdm_reader_mapper.common.ProcessFunction(data, func, func_args=None, func_kwargs=None, **kwargs)[source]¶

Bases: object

Stores data and a callable function with optional arguments for processing.

Parameters:

data (pd.DataFrame, pd.Series, iterable of pd.DataFrame or iterable of pd.Series) – Input data to be processed.
func (Callable) – A callable that will be applied to data.
func_args (Any, list of Any or tuple of Any, optional) – Positional arguments to pass to func.
func_kwargs (dict, optional) – Keyword arguments to pass to func.
**kwargs (Any) – Additional metadata or configuration parameters stored with the instance.

cdm_reader_mapper.common.collect_json_files(idir, *args, base=None, name=None)[source]¶

Collect JSON files recursively based on directory and optional subdirectories.

Parameters:

idir (str) – Base directory to search.
*args (str) – Optional subdirectory names for recursive searching.
base (str, optional) – Base path to prepend to idir.
name (str, optional) – Base file name to search. If None, defaults to idir.

Return type:

list[Path]

Returns:

list of Path – List of matching JSON file paths.

cdm_reader_mapper.common.combine_dicts(list_of_files, base=None)[source]¶

Combine multiple JSON files or dictionaries into a single dictionary.

Supports nested ‘substitute’ references to recursively load additional JSON files.

Parameters:

list_of_files (str, Path, list) – JSON file(s) or dictionaries to combine.
base (str, optional) – Base path used when resolving substituted files.

Return type:

dict[str, Any]

Returns:

dict – Combined dictionary from all input files/dictionaries.

cdm_reader_mapper.common.count_by_cat(data, columns=None)[source]¶

Count unique values per column in a DataFrame or a Iterable of DataFrame.

Parameters:

data (pandas.DataFrame or Iterable[pd.DataFrame]) – Input dataset.
columns (str, list or tuple, optional) – Name(s) of the data column(s) to be selected. If None, all columns are used.

Return type:

dict[str | tuple[str, str], dict[Any, int]]

Returns:

Dict[str | tuple[str, str], int] – Dictionary where each key is a column name, and each value is a dictionary mapping unique values (including NaN as ‘nan’) to their counts.

Notes

Works with large files via ParquetStreamReader by iterating through chunks.

cdm_reader_mapper.common.get_filename(pattern, path='.', extension='pq', separator='_')[source]¶

Construct a filename from a sequence of string components.

Parameters:

pattern (Sequence[str]) – Iterable of string components to be joined with hyphens (e.g., [“sales”, “2024”, “Q1”]). Empty or falsy items are ignored.
path (str or Path-like, optional) – Directory in which the file should be placed. Default is current directory “.”.
extension (str, optional) – File extension, with or without a leading dot (e.g., “pq” or “.pq”). Default is “pq”.
separator (str, optional) – Separator to join the pattern components (default “_”).

Return type:

str

Returns:

str – Full file path including directory, filename, and extension. Returns an empty string if pattern is empty or contains no truthy elements.

Notes

Any empty or falsy parts of pattern will be removed.
The extension is normalized to always begin with a leading dot.

Examples

>>> get_filename(["data", "2025"])
'./data-2025.psv'

>>> get_filename(["report", ""], path="/tmp", extension=".txt")
'/tmp/report.txt'

>>> get_filename(["part1", "part2"], separator="_")
'./part1_part2.psv'

cdm_reader_mapper.common.get_length(data)[source]¶

Get the total number of rows in a pandas object.

Parameters:: data (pandas.DataFrame or Iterable[pd.DataFrame]) – Input dataset.
Return type:: int
Returns:: int – Total number of rows.

Notes

Works with large files via ParquetStreamReader by using a specialized handler to count rows without loading the entire file into memory.

cdm_reader_mapper.common.is_valid_iterator(reader)[source]¶

Check if reader is a valid Iterable.

Parameters:: reader (Any) – Object to be checked for iterator compatibility.
Return type:: bool
Returns:: bool – True if reader is an instance of Iterator, otherwise False.

cdm_reader_mapper.common.load_file(name, github_url='https://github.com/glamod/cdm-testdata', branch='main', cache=True, cache_dir=PosixPath('/home/docs/.cache/cdm-testdata'), clear_cache=False, within_drs=True)[source]¶

Load file from the online Github-like repository.

Parameters:

name (str or os.PathLike) – Name of the file containing the dataset.
github_url (str) – URL to GitHub repository where the data is stored.
branch (str, optional) – For GitHub-hosted files, the branch to download from.
cache (bool) – If True, then cache data locally for use on subsequent calls.
cache_dir (Path) – The directory in which to search for and write cached data.
clear_cache (bool) – If True, clear cache directory.
within_drs (bool) – If True, then download data within data reference syntax.

Return type:

Path

Returns:

Path – Destination path of the downloaded file.

cdm_reader_mapper.common.open_json_file(ifile, encoding='utf-8')[source]¶

Open a JSON file and return its contents as a dictionary.

Parameters:

ifile (str or Path) – Path to the JSON file.
encoding (str, default 'utf-8') – Encoding to use when reading the file.

Return type:

dict[Any, Any]

Returns:

dict – Contents of the JSON file.

cdm_reader_mapper.common.parquet_stream_from_iterable(iterable)[source]¶

Stream an iterable of DataFrame/Series to parquet and return a disk-backed ParquetStreamReader.

Memory usage remains constant.

Parameters:

iterable (Iterable pf pd.DataFrame or pd.Series) – An iterable of pandas DataFrame or Series objects to be streamed to disk.

Return type:

ParquetStreamReader

Returns:

ParquetStreamReader – A disk-backed stream reader that lazily reads the provided iterable from Parquet files stored in a temporary directory.

Raises:

ValueError – If the input iterable is empty.
TypeError – If elements in the iterable are not pandas DataFrame or Series objects, or if mixed types are provided across chunks.

cdm_reader_mapper.common.process_disk_backed(reader, func, func_args=None, func_kwargs=None, requested_types=(<class 'pandas.DataFrame'>, <class 'pandas.Series'>), non_data_output='first', non_data_proc=None, non_data_proc_args=None, non_data_proc_kwargs=None, makecopy=True)[source]¶

Consume a stream of DataFrames, processes them, and returns a tuple of results.

DataFrames are cached to disk (Parquet) and returned as generators.

Parameters:

reader (Iterator of pd.DataFrame or pd.Series) – Input stream of DataFrame or Series objects to be processed in chunks.
func (Callable) – Function applied to each synchronized set of chunks from the stream. May return data objects (pd.DataFrame or pd.Series) and/or metadata.
func_args (tuple of Any, optional) – Additional positional arguments passed to func. Defaults to empty tuple.
func_kwargs (dict, optional) – Additional keyword arguments passed to func. Defaults to empty dict.
requested_types (type or tuple of type, default (pd.DataFrame, pd.Series)) – Types treated as data outputs from func. All other outputs are treated as metadata.
non_data_output ({"first", "acc"}, default "first") – Strategy for handling non-data outputs: - “first”: only the first chunk’s metadata is kept - “acc”: accumulate metadata across all chunks
non_data_proc (Callable, optional) – Optional function applied to aggregated non-data outputs after processing.
non_data_proc_args (tuple of Any, optional) – Positional arguments for non_data_proc.
non_data_proc_kwargs (dict, optional) – Keyword arguments for non_data_proc.
makecopy (bool, default True) – If True, ensures independent copies of input streams are used internally.

Return type:

tuple[Any, ...] | None

Returns:

tuple of Any or None – A tuple containing:

One or more ParquetStreamReader objects for chunked data outputs (if any data was produced)
Processed non-data outputs (metadata), optionally transformed by non_data_proc

Raises:

ValueError – If non_data_proc is provided but not callable.

cdm_reader_mapper.common.process_function(data_only=False, postprocessing=None)[source]¶

Decorator to apply function to both pd.DataFrame and Iterable[pd.DataFrame].

Parameters:

data_only (bool, default False) – If True, only the primary data output is returned from the processed result.
postprocessing (dict, optional) – Optional configuration for a postprocessing step applied to each result. Expected keys: - “func”: callable applied to each DataFrame/Series/stream output - “kwargs”: list or dict of argument names taken from the original call

Return type:

Callable[..., Any]

Returns:

Callable – A decorator that wraps a function so it can operate on both in-memory pandas objects and disk-backed ParquetStreamReader streams.

Raises:

ValueError – If a provided postprocessing function is not callable.

cdm_reader_mapper.common.replace_columns(df_l, df_r, pivot_c=None, pivot_l=None, pivot_r=None, rep_c=None, rep_map=None)[source]¶

Replace columns in one DataFrame using row-matching from another.

This function works for both a pd.DataFrame and any Iterable of of pandas DataFrames.

Parameters:

df_l (pandas.DataFrame or Iterable[pd.dataFrame]) – The left DataFrame whose columns will be replaced.
df_r (pandas.DataFrame or Iterable[pd.dataFrame]) – The right DataFrame providing replacement values.
pivot_c (str, optional) – A single pivot column present in both DataFrames. Overrides pivot_l and pivot_r.
pivot_l (str, optional) – Pivot column in df_l. Used only when pivot_c is not supplied.
pivot_r (str, optional) – Pivot column in df_r. Used only when pivot_c is not supplied.
rep_c (str or list of str, optional) – One or more column names to replace in df_l. Ignored if rep_map is supplied.
rep_map (dict, optional) – Mapping between left and right column names as {left_col: right_col}.

Returns:

pd.DataFrame or ParquetStreamReader – Updated data with replacements applied.

Raises:

TypeError – If df_l or df_r is not a pandas DataFrame.
ValueError –
- If one of pivot_l and pivot_r is not defined. - If rep_map and rep_c is not defined. - If replacement source columns not found in df_r.

Notes

This function logs errors and returns None instead of raising exceptions.

cdm_reader_mapper.common.split_by_boolean(data, mask, boolean, reset_index=False, inverse=False, return_rejected=False)[source]¶

Split both a DataFrame and an Iterable of DataFrames using a boolean mask via split_dataframe_by_boolean.

Parameters:

data (pd.DataFrame or iterable of pd.DataFrame) – DataFrame to be split.
mask (pd.DataFrame or iterable of pd.DataFrame) – Boolean mask with the same length as data.
boolean (bool) – Determines mask interpretation:
- If True select rows where all mask columns are True.
- If False select rows where any mask column is False.
reset_index (bool, optional) – If True reset the index of returned DataFrames.
inverse (bool, optional) – If True invert the selection performed by the underlying function.
return_rejected (bool, optional) – If True return rejected rows as the second output. If False the rejected output is empty but dtype-preserving.

Return type:

tuple[DataFrame | ParquetStreamReader, DataFrame | ParquetStreamReader, Index | MultiIndex, Index | MultiIndex]

Returns:

tuple of pd.DataFrame or ParquetStreamReader and pd.DataFrame or ParquetStreamReader and pd.Index or pd.MultiIndex and pd.Index or pd.MultiIndex – Selected rows (all mask columns True), rejected rows, original indexes of selection and original indexes of rejection.

cdm_reader_mapper.common.split_by_boolean_false(data, mask, reset_index=False, inverse=False, return_rejected=False)[source]¶

Split both a DataFrame or an Iterable of DataFrames where boolean mask is False.

Parameters:

data (pd.DataFrame or Iterable[pd.DataFrame]) – DataFrame to be split.
mask (pd.DataFrame or Iterable[pd.DataFrame]) – Boolean mask with the same length as data.
reset_index (bool, optional) – If True reset indices in returned DataFrames.
inverse (bool, optional) – If True invert the selection.
return_rejected (bool, optional) – If True return rejected rows as the second output. If False the rejected output is empty but dtype-preserving.

Return type:

tuple[DataFrame | ParquetStreamReader, DataFrame | ParquetStreamReader, Index | MultiIndex, Index | MultiIndex]

Returns:

cdm_reader_mapper.common.split_by_boolean_true(data, mask, reset_index=False, inverse=False, return_rejected=False)[source]¶

Split both a DataFrame or an Iterable of DataFrames where boolean mask is True.

Parameters:

data (pd.DataFrame or iterable of pd.DataFrame) – DataFrame to be split.
mask (pd.DataFrame or iterable of pd.DataFrame) – Boolean mask with the same length as data.
reset_index (bool, optional) – If True reset indices in returned DataFrames.
inverse (bool, optional) – If True invert the selection.
return_rejected (bool, optional) – If True return rejected rows as the second output. If False the rejected output is empty but dtype-preserving.

Return type:

tuple[DataFrame | ParquetStreamReader, DataFrame | ParquetStreamReader, Index | MultiIndex, Index | MultiIndex]

Returns:

cdm_reader_mapper.common.split_by_column_entries(data, selection, reset_index=False, inverse=False, return_rejected=False)[source]¶

Split both a DataFrame or an Iterable of DataFrames based on matching values in a given column.

Parameters:

data (pd.DataFrame or iterable of pd.DataFrame) – DataFrame to be split.
selection (dict) – Mapping of a column name to an iterable of allowed values. Example: {“city”: [“London”, “Berlin”]}.
reset_index (bool, optional) – Whether to reset index in returned DataFrames.
inverse (bool, optional) – If True invert the selection.
return_rejected (bool, optional) – If True return rejected rows as the second output. If False the rejected output is empty but dtype-preserving.

Return type:

tuple[DataFrame | ParquetStreamReader, DataFrame | ParquetStreamReader, Index | MultiIndex, Index | MultiIndex]

Returns:

cdm_reader_mapper.common.split_by_index(data, index, reset_index=False, inverse=False, return_rejected=False)[source]¶

Split both a DataFrame or an Iterable of DataFrames by selecting specific index labels.

Parameters:

data (pd.DataFrame or iterable of DataFrame) – DataFrame to be split.
index (label or sequence of labels) – Index values to select.
reset_index (bool, optional) – If True reset index in returned DataFrames.
inverse (bool, optional) – If True select rows not in index.
return_rejected (bool, optional) – If True return rejected rows as the second output. If False the rejected output is empty but dtype-preserving.

Return type:

tuple[DataFrame | ParquetStreamReader, DataFrame | ParquetStreamReader, Index | MultiIndex, Index | MultiIndex]

Returns:

cdm_reader_mapper.common.standardize_object_columns(df)[source]¶

Convert string columns to object dtype and replace NaNs with None.

Parameters:: df (pd.DataFrame) – The input DataFrame to be standardized.
Return type:: DataFrame
Returns:: pd.DataFrame – The same DataFrame instance after the dtype conversion and NaN handling.

Submodules¶

cdm_reader_mapper.common.getting_files module¶

Internal pandas local file operator.

cdm_reader_mapper.common.getting_files.get_path(path)[source]¶

Get path either from _files(path) or directly from file system.

Parameters:: path (str | Path) – If it points to an existing file on disk, that file is returned. Otherwise the value is interpreted as a module name or as <module>:<subpath> (e.g. "mypkg:templates/index.html").
Return type:: Path | None
Returns:: Path | None – The resolved path or None if the resource cannot be found.

cdm_reader_mapper.common.getting_files.load_file(name, github_url='https://github.com/glamod/cdm-testdata', branch='main', cache=True, cache_dir=PosixPath('/home/docs/.cache/cdm-testdata'), clear_cache=False, within_drs=True)[source]¶

Load file from the online Github-like repository.

Parameters:

name (str or os.PathLike) – Name of the file containing the dataset.
github_url (str) – URL to GitHub repository where the data is stored.
branch (str, optional) – For GitHub-hosted files, the branch to download from.
cache (bool) – If True, then cache data locally for use on subsequent calls.
cache_dir (Path) – The directory in which to search for and write cached data.
clear_cache (bool) – If True, clear cache directory.
within_drs (bool) – If True, then download data within data reference syntax.

Return type:

Path

Returns:

Path – Destination path of the downloaded file.

cdm_reader_mapper.common.inspect module¶

Common Data Model (CDM) pandas inspection operators.

Created on Wed Jul 3 09:48:18 2019

@author: iregon

cdm_reader_mapper.common.inspect.count_by_cat(data, columns=None)[source]¶

Count unique values per column in a DataFrame or a Iterable of DataFrame.

Parameters:

data (pandas.DataFrame or Iterable[pd.DataFrame]) – Input dataset.
columns (str, list or tuple, optional) – Name(s) of the data column(s) to be selected. If None, all columns are used.

Return type:

dict[str | tuple[str, str], dict[Any, int]]

Returns:

Dict[str | tuple[str, str], int] – Dictionary where each key is a column name, and each value is a dictionary mapping unique values (including NaN as ‘nan’) to their counts.

Notes

Works with large files via ParquetStreamReader by iterating through chunks.

cdm_reader_mapper.common.inspect.get_length(data)[source]¶

Get the total number of rows in a pandas object.

Parameters:: data (pandas.DataFrame or Iterable[pd.DataFrame]) – Input dataset.
Return type:: int
Returns:: int – Total number of rows.

Notes

Works with large files via ParquetStreamReader by using a specialized handler to count rows without loading the entire file into memory.

cdm_reader_mapper.common.inspect.merge_sum_dicts(dicts)[source]¶

Recursively merge dictionaries, summing numeric values at the leaves.

Parameters:: dicts (list of Mapping) – A list of dictionaries for recursiv merging.
Return type:: dict[str, Any]
Returns:: dict – Recursively merged dictionary.

cdm_reader_mapper.common.io_files module¶

Utility function for reading and writing files.

cdm_reader_mapper.common.io_files.get_filename(pattern, path='.', extension='pq', separator='_')[source]¶

Construct a filename from a sequence of string components.

Parameters:

pattern (Sequence[str]) – Iterable of string components to be joined with hyphens (e.g., [“sales”, “2024”, “Q1”]). Empty or falsy items are ignored.
path (str or Path-like, optional) – Directory in which the file should be placed. Default is current directory “.”.
extension (str, optional) – File extension, with or without a leading dot (e.g., “pq” or “.pq”). Default is “pq”.
separator (str, optional) – Separator to join the pattern components (default “_”).

Return type:

str

Returns:

str – Full file path including directory, filename, and extension. Returns an empty string if pattern is empty or contains no truthy elements.

Notes

Any empty or falsy parts of pattern will be removed.
The extension is normalized to always begin with a leading dot.

Examples

>>> get_filename(["data", "2025"])
'./data-2025.psv'

>>> get_filename(["report", ""], path="/tmp", extension=".txt")
'/tmp/report.txt'

>>> get_filename(["part1", "part2"], separator="_")
'./part1_part2.psv'

cdm_reader_mapper.common.iterators module¶

Utilities for handling pandas TextParser objects safely.

class cdm_reader_mapper.common.iterators.ProcessFunction(data, func, func_args=None, func_kwargs=None, **kwargs)[source]¶

Bases: object

Stores data and a callable function with optional arguments for processing.

Parameters:

data (pd.DataFrame, pd.Series, iterable of pd.DataFrame or iterable of pd.Series) – Input data to be processed.
func (Callable) – A callable that will be applied to data.
func_args (Any, list of Any or tuple of Any, optional) – Positional arguments to pass to func.
func_kwargs (dict, optional) – Keyword arguments to pass to func.
**kwargs (Any) – Additional metadata or configuration parameters stored with the instance.

cdm_reader_mapper.common.iterators.ensure_parquet_reader(obj)[source]¶

Ensure obj is a ParquetStreamReader.

Parameters:: obj (Any) – Object that may represent a ParquetStreamReader, an iterator of pd.DataFrame or pd.Series objects, or a static value.
Return type:: Any
Returns:: Any – If obj is already a ParquetStreamReader, it is returned unchanged. If obj is an iterator, it is converted into a ParquetStreamReader. Otherwise, obj is returned as-is (treated as a static value).

cdm_reader_mapper.common.iterators.is_valid_iterator(reader)[source]¶

Check if reader is a valid Iterable.

Parameters:: reader (Any) – Object to be checked for iterator compatibility.
Return type:: bool
Returns:: bool – True if reader is an instance of Iterator, otherwise False.

cdm_reader_mapper.common.iterators.parquet_stream_from_iterable(iterable)[source]¶

Stream an iterable of DataFrame/Series to parquet and return a disk-backed ParquetStreamReader.

Memory usage remains constant.

Parameters:

iterable (Iterable pf pd.DataFrame or pd.Series) – An iterable of pandas DataFrame or Series objects to be streamed to disk.

Return type:

ParquetStreamReader

Returns:

ParquetStreamReader – A disk-backed stream reader that lazily reads the provided iterable from Parquet files stored in a temporary directory.

Raises:

ValueError – If the input iterable is empty.
TypeError – If elements in the iterable are not pandas DataFrame or Series objects, or if mixed types are provided across chunks.

cdm_reader_mapper.common.iterators.process_disk_backed(reader, func, func_args=None, func_kwargs=None, requested_types=(<class 'pandas.DataFrame'>, <class 'pandas.Series'>), non_data_output='first', non_data_proc=None, non_data_proc_args=None, non_data_proc_kwargs=None, makecopy=True)[source]¶

Consume a stream of DataFrames, processes them, and returns a tuple of results.

DataFrames are cached to disk (Parquet) and returned as generators.

Parameters:

reader (Iterator of pd.DataFrame or pd.Series) – Input stream of DataFrame or Series objects to be processed in chunks.
func (Callable) – Function applied to each synchronized set of chunks from the stream. May return data objects (pd.DataFrame or pd.Series) and/or metadata.
func_args (tuple of Any, optional) – Additional positional arguments passed to func. Defaults to empty tuple.
func_kwargs (dict, optional) – Additional keyword arguments passed to func. Defaults to empty dict.
requested_types (type or tuple of type, default (pd.DataFrame, pd.Series)) – Types treated as data outputs from func. All other outputs are treated as metadata.
non_data_output ({"first", "acc"}, default "first") – Strategy for handling non-data outputs: - “first”: only the first chunk’s metadata is kept - “acc”: accumulate metadata across all chunks
non_data_proc (Callable, optional) – Optional function applied to aggregated non-data outputs after processing.
non_data_proc_args (tuple of Any, optional) – Positional arguments for non_data_proc.
non_data_proc_kwargs (dict, optional) – Keyword arguments for non_data_proc.
makecopy (bool, default True) – If True, ensures independent copies of input streams are used internally.

Return type:

tuple[Any, ...] | None

Returns:

tuple of Any or None – A tuple containing:

One or more ParquetStreamReader objects for chunked data outputs (if any data was produced)
Processed non-data outputs (metadata), optionally transformed by non_data_proc

Raises:

ValueError – If non_data_proc is provided but not callable.

cdm_reader_mapper.common.iterators.process_function(data_only=False, postprocessing=None)[source]¶

Decorator to apply function to both pd.DataFrame and Iterable[pd.DataFrame].

Parameters:

data_only (bool, default False) – If True, only the primary data output is returned from the processed result.
postprocessing (dict, optional) – Optional configuration for a postprocessing step applied to each result. Expected keys: - “func”: callable applied to each DataFrame/Series/stream output - “kwargs”: list or dict of argument names taken from the original call

Return type:

Callable[..., Any]

Returns:

Callable – A decorator that wraps a function so it can operate on both in-memory pandas objects and disk-backed ParquetStreamReader streams.

Raises:

ValueError – If a provided postprocessing function is not callable.

cdm_reader_mapper.common.json_dict module¶

JSON dictionary manipulator.

cdm_reader_mapper.common.json_dict.collect_json_files(idir, *args, base=None, name=None)[source]¶

Collect JSON files recursively based on directory and optional subdirectories.

Parameters:

idir (str) – Base directory to search.
*args (str) – Optional subdirectory names for recursive searching.
base (str, optional) – Base path to prepend to idir.
name (str, optional) – Base file name to search. If None, defaults to idir.

Return type:

list[Path]

Returns:

list of Path – List of matching JSON file paths.

cdm_reader_mapper.common.json_dict.combine_dicts(list_of_files, base=None)[source]¶

Combine multiple JSON files or dictionaries into a single dictionary.

Supports nested ‘substitute’ references to recursively load additional JSON files.

Parameters:

list_of_files (str, Path, list) – JSON file(s) or dictionaries to combine.
base (str, optional) – Base path used when resolving substituted files.

Return type:

dict[str, Any]

Returns:

dict – Combined dictionary from all input files/dictionaries.

cdm_reader_mapper.common.json_dict.open_json_file(ifile, encoding='utf-8')[source]¶

Open a JSON file and return its contents as a dictionary.

Parameters:

ifile (str or Path) – Path to the JSON file.
encoding (str, default 'utf-8') – Encoding to use when reading the file.

Return type:

dict[Any, Any]

Returns:

dict – Contents of the JSON file.

cdm_reader_mapper.common.logging_hdlr module¶

Initialize logger.

Created on Wed Apr 3 08:45:03 2019

@author: iregon

cdm_reader_mapper.common.logging_hdlr.init_logger(module, level='INFO', fn=None)[source]¶

Initialize and configure a logger for a given module.

Parameters:

module (str) – Name of the module or logger.
level (str, default 'INFO') – Logging level as string (‘DEBUG’, ‘INFO’, ‘WARNING’, ‘ERROR’, ‘CRITICAL’).
fn (str or None, optional) – Optional filename to write logs to. If None, logs go to stdout.

Return type:

Logger

Returns:

logging.Logger – Configured logger instance.

Notes

This function calls logging.basicConfig to configure the root logger. Repeated calls to this function may not reconfigure logging unless reload(logging) is used.

cdm_reader_mapper.common.object_types module¶

Utility function for reading and writing files.

cdm_reader_mapper.common.object_types.standardize_object_columns(df)[source]¶

Convert string columns to object dtype and replace NaNs with None.

Parameters:: df (pd.DataFrame) – The input DataFrame to be standardized.
Return type:: DataFrame
Returns:: pd.DataFrame – The same DataFrame instance after the dtype conversion and NaN handling.

cdm_reader_mapper.common.replace module¶

Common Data Model (CDM) pandas replacement operators.

Created on Wed Jul 3 09:48:18 2019

Replace columns from right dataframe into left dataframe

Replacement occurs on a pivot column, that might have the same name in both dfs (pivot_c) or be different (pivot_l and pivot_r)

Can replace one or multiple columns and support multiindexing (tested only on left, so far…)

Replacement arguments:

rep_c : list or string of column name(s) to replace, they are the same name in left and right
rep_map: dictionary with {col_l:col_r…} if not the same

@author: iregon

cdm_reader_mapper.common.replace.replace_columns(df_l, df_r, pivot_c=None, pivot_l=None, pivot_r=None, rep_c=None, rep_map=None)[source]¶

Replace columns in one DataFrame using row-matching from another.

This function works for both a pd.DataFrame and any Iterable of of pandas DataFrames.

Parameters:

df_l (pandas.DataFrame or Iterable[pd.dataFrame]) – The left DataFrame whose columns will be replaced.
df_r (pandas.DataFrame or Iterable[pd.dataFrame]) – The right DataFrame providing replacement values.
pivot_c (str, optional) – A single pivot column present in both DataFrames. Overrides pivot_l and pivot_r.
pivot_l (str, optional) – Pivot column in df_l. Used only when pivot_c is not supplied.
pivot_r (str, optional) – Pivot column in df_r. Used only when pivot_c is not supplied.
rep_c (str or list of str, optional) – One or more column names to replace in df_l. Ignored if rep_map is supplied.
rep_map (dict, optional) – Mapping between left and right column names as {left_col: right_col}.

Returns:

pd.DataFrame or ParquetStreamReader – Updated data with replacements applied.

Raises:

TypeError – If df_l or df_r is not a pandas DataFrame.
ValueError –
- If one of pivot_l and pivot_r is not defined. - If rep_map and rep_c is not defined. - If replacement source columns not found in df_r.

Notes

This function logs errors and returns None instead of raising exceptions.

cdm_reader_mapper.common.select module¶

Common Data Model (CDM) pandas selection operators.

Created on Wed Jul 3 09:48:18 2019

@author: iregon

cdm_reader_mapper.common.select.split_by_boolean(data, mask, boolean, reset_index=False, inverse=False, return_rejected=False)[source]¶

Split both a DataFrame and an Iterable of DataFrames using a boolean mask via split_dataframe_by_boolean.

Parameters:

data (pd.DataFrame or iterable of pd.DataFrame) – DataFrame to be split.
mask (pd.DataFrame or iterable of pd.DataFrame) – Boolean mask with the same length as data.
boolean (bool) – Determines mask interpretation:
- If True select rows where all mask columns are True.
- If False select rows where any mask column is False.
reset_index (bool, optional) – If True reset the index of returned DataFrames.
inverse (bool, optional) – If True invert the selection performed by the underlying function.
return_rejected (bool, optional) – If True return rejected rows as the second output. If False the rejected output is empty but dtype-preserving.

Return type:

tuple[DataFrame | ParquetStreamReader, DataFrame | ParquetStreamReader, Index | MultiIndex, Index | MultiIndex]

Returns:

cdm_reader_mapper.common.select.split_by_boolean_false(data, mask, reset_index=False, inverse=False, return_rejected=False)[source]¶

Split both a DataFrame or an Iterable of DataFrames where boolean mask is False.

Parameters:

data (pd.DataFrame or Iterable[pd.DataFrame]) – DataFrame to be split.
mask (pd.DataFrame or Iterable[pd.DataFrame]) – Boolean mask with the same length as data.
reset_index (bool, optional) – If True reset indices in returned DataFrames.
inverse (bool, optional) – If True invert the selection.
return_rejected (bool, optional) – If True return rejected rows as the second output. If False the rejected output is empty but dtype-preserving.

Return type:

tuple[DataFrame | ParquetStreamReader, DataFrame | ParquetStreamReader, Index | MultiIndex, Index | MultiIndex]

Returns:

cdm_reader_mapper.common.select.split_by_boolean_true(data, mask, reset_index=False, inverse=False, return_rejected=False)[source]¶

Split both a DataFrame or an Iterable of DataFrames where boolean mask is True.

Parameters:

data (pd.DataFrame or iterable of pd.DataFrame) – DataFrame to be split.
mask (pd.DataFrame or iterable of pd.DataFrame) – Boolean mask with the same length as data.
reset_index (bool, optional) – If True reset indices in returned DataFrames.
inverse (bool, optional) – If True invert the selection.
return_rejected (bool, optional) – If True return rejected rows as the second output. If False the rejected output is empty but dtype-preserving.

Return type:

tuple[DataFrame | ParquetStreamReader, DataFrame | ParquetStreamReader, Index | MultiIndex, Index | MultiIndex]

Returns:

cdm_reader_mapper.common.select.split_by_column_entries(data, selection, reset_index=False, inverse=False, return_rejected=False)[source]¶

Split both a DataFrame or an Iterable of DataFrames based on matching values in a given column.

Parameters:

data (pd.DataFrame or iterable of pd.DataFrame) – DataFrame to be split.
selection (dict) – Mapping of a column name to an iterable of allowed values. Example: {“city”: [“London”, “Berlin”]}.
reset_index (bool, optional) – Whether to reset index in returned DataFrames.
inverse (bool, optional) – If True invert the selection.
return_rejected (bool, optional) – If True return rejected rows as the second output. If False the rejected output is empty but dtype-preserving.

Return type:

tuple[DataFrame | ParquetStreamReader, DataFrame | ParquetStreamReader, Index | MultiIndex, Index | MultiIndex]

Returns:

cdm_reader_mapper.common.select.split_by_index(data, index, reset_index=False, inverse=False, return_rejected=False)[source]¶

Split both a DataFrame or an Iterable of DataFrames by selecting specific index labels.

Parameters:

data (pd.DataFrame or iterable of DataFrame) – DataFrame to be split.
index (label or sequence of labels) – Index values to select.
reset_index (bool, optional) – If True reset index in returned DataFrames.
inverse (bool, optional) – If True select rows not in index.
return_rejected (bool, optional) – If True return rejected rows as the second output. If False the rejected output is empty but dtype-preserving.

Return type:

tuple[DataFrame | ParquetStreamReader, DataFrame | ParquetStreamReader, Index | MultiIndex, Index | MultiIndex]

Returns: