cdm_reader_mapper.common package¶
Common Data Model (CDM) reader and mapper common pandas operators.
- cdm_reader_mapper.common.count_by_cat(data, columns=None)[source]¶
Count unique values per column in a DataFrame or a Iterable of DataFrame.
- Parameters:
data (
pandas.DataFrameorIterable[pd.DataFrame]) – Input dataset.columns (
str,listortuple, optional) – Name(s) of the data column(s) to be selected. If None, all columns are used.
- Return type:
- Returns:
Dict[str | tuple[str,str],int]– Dictionary where each key is a column name, and each value is a dictionary mapping unique values (including NaN as ‘nan’) to their counts.
Notes
Works with large files via ParquetStreamReader by iterating through chunks.
- cdm_reader_mapper.common.get_filename(pattern, path='.', extension='pq', separator='_')[source]¶
Construct a filename from a sequence of string components.
- Parameters:
pattern (
Sequence[str]) – Iterable of string components to be joined with hyphens (e.g., [“sales”, “2024”, “Q1”]). Empty or falsy items are ignored.path (
strorPath-like, optional) – Directory in which the file should be placed. Default is current directory “.”.extension (
str, optional) – File extension, with or without a leading dot (e.g., “pq” or “.pq”). Default is “pq”.separator (
str, optional) – Separator to join the pattern components (default “_”).
- Return type:
- Returns:
str– Full file path including directory, filename, and extension. Returns an empty string if pattern is empty or contains no truthy elements.
Notes
Any empty or falsy parts of pattern will be removed.
The extension is normalized to always begin with a leading dot.
Examples
>>> get_filename(["data", "2025"]) './data-2025.psv'
>>> get_filename(["report", ""], path="/tmp", extension=".txt") '/tmp/report.txt'
>>> get_filename(["part1", "part2"], separator="_") './part1_part2.psv'
- cdm_reader_mapper.common.get_length(data)[source]¶
Get the total number of rows in a pandas object.
- Parameters:
data (
pandas.DataFrameorIterable[pd.DataFrame]) – Input dataset.- Return type:
- Returns:
int– Total number of rows.
Notes
Works with large files via ParquetStreamReader by using a specialized handler to count rows without loading the entire file into memory.
- cdm_reader_mapper.common.load_file(name, github_url='https://github.com/glamod/cdm-testdata', branch='main', cache=True, cache_dir=PosixPath('/home/docs/.cache/cdm-testdata'), clear_cache=False, within_drs=True)[source]¶
Load file from the online Github-like repository.
- Parameters:
name (
stroros.PathLike) – Name of the file containing the dataset.github_url (
str) – URL to GitHub repository where the data is stored.branch (
str, optional) – For GitHub-hosted files, the branch to download from.cache (
bool) – If True, then cache data locally for use on subsequent calls.cache_dir (
Path) – The directory in which to search for and write cached data.clear_cache (
bool) – If True, clear cache directory.within_drs (
bool) – If True, then download data within data reference syntax.
- Return type:
- Returns:
Path– Destination path of the downloaded file.
- cdm_reader_mapper.common.replace_columns(df_l, df_r, pivot_c=None, pivot_l=None, pivot_r=None, rep_c=None, rep_map=None)[source]¶
Replace columns in one DataFrame using row-matching from another.
This function works for both a pd.DataFrame and any Iterable of of pandas DataFrames.
- Parameters:
df_l (
pandas.DataFrameorIterable[pd.dataFrame]) – The left DataFrame whose columns will be replaced.df_r (
pandas.DataFrameorIterable[pd.dataFrame]) – The right DataFrame providing replacement values.pivot_c (
str, optional) – A single pivot column present in both DataFrames. Overrides pivot_l and pivot_r.pivot_l (
str, optional) – Pivot column in df_l. Used only when pivot_c is not supplied.pivot_r (
str, optional) – Pivot column in df_r. Used only when pivot_c is not supplied.rep_c (
strorlistofstr, optional) – One or more column names to replace in df_l. Ignored if rep_map is supplied.rep_map (
dict, optional) – Mapping between left and right column names as {left_col: right_col}.
- Returns:
pd.DataFrameorParquetStreamReader– Updated data with replacements applied.- Raises:
TypeError – If df_l or df_r is not a pandas DataFrame.
If one of pivot_l and pivot_r is not defined. - If rep_map and rep_c is not defined. - If replacement source columns not found in df_r.
Notes
This function logs errors and returns None instead of raising exceptions.
- cdm_reader_mapper.common.split_by_boolean(data, mask, boolean, reset_index=False, inverse=False, return_rejected=False)[source]¶
Split both a DataFrame and an Iterable of DataFrames using a boolean mask via
split_dataframe_by_boolean.- Parameters:
data (
pd.DataFrameor iterable ofpd.DataFrame) – DataFrame to be split.mask (
pd.DataFrameor iterable ofpd.DataFrame) – Boolean mask with the same length as data.boolean (
bool) – Determines mask interpretation:If True select rows where all mask columns are True.
If False select rows where any mask column is False.
reset_index (
bool, optional) – If True reset the index of returned DataFrames.inverse (
bool, optional) – If True invert the selection performed by the underlying function.return_rejected (
bool, optional) – If True return rejected rows as the second output. If False the rejected output is empty but dtype-preserving.
- Return type:
tuple[DataFrame|ParquetStreamReader,DataFrame|ParquetStreamReader,Index|MultiIndex,Index|MultiIndex]- Returns:
tupleofpd.DataFrameorParquetStreamReaderandpd.DataFrameorParquetStreamReaderandpd.Indexorpd.MultiIndexandpd.Indexorpd.MultiIndex– Selected rows (all mask columns True), rejected rows, original indexes of selection and original indexes of rejection.
- cdm_reader_mapper.common.split_by_boolean_false(data, mask, reset_index=False, inverse=False, return_rejected=False)[source]¶
Split both a DataFrame or an Iterable of DataFrames where boolean mask is False.
- Parameters:
data (
pd.DataFrameorIterable[pd.DataFrame]) – DataFrame to be split.mask (
pd.DataFrameorIterable[pd.DataFrame]) – Boolean mask with the same length as data.reset_index (
bool, optional) – If True reset indices in returned DataFrames.inverse (
bool, optional) – If True invert the selection.return_rejected (
bool, optional) – If True return rejected rows as the second output. If False the rejected output is empty but dtype-preserving.
- Return type:
tuple[DataFrame|ParquetStreamReader,DataFrame|ParquetStreamReader,Index|MultiIndex,Index|MultiIndex]- Returns:
tupleofpd.DataFrameorParquetStreamReaderandpd.DataFrameorParquetStreamReaderandpd.Indexorpd.MultiIndexandpd.Indexorpd.MultiIndex– Selected rows (all mask columns True), rejected rows, original indexes of selection and original indexes of rejection.
- cdm_reader_mapper.common.split_by_boolean_true(data, mask, reset_index=False, inverse=False, return_rejected=False)[source]¶
Split both a DataFrame or an Iterable of DataFrames where boolean mask is True.
- Parameters:
data (
pd.DataFrameor iterable ofpd.DataFrame) – DataFrame to be split.mask (
pd.DataFrameor iterable ofpd.DataFrame) – Boolean mask with the same length as data.reset_index (
bool, optional) – If True reset indices in returned DataFrames.inverse (
bool, optional) – If True invert the selection.return_rejected (
bool, optional) – If True return rejected rows as the second output. If False the rejected output is empty but dtype-preserving.
- Return type:
tuple[DataFrame|ParquetStreamReader,DataFrame|ParquetStreamReader,Index|MultiIndex,Index|MultiIndex]- Returns:
tupleofpd.DataFrameorParquetStreamReaderandpd.DataFrameorParquetStreamReaderandpd.Indexorpd.MultiIndexandpd.Indexorpd.MultiIndex– Selected rows (all mask columns True), rejected rows, original indexes of selection and original indexes of rejection.
- cdm_reader_mapper.common.split_by_column_entries(data, selection, reset_index=False, inverse=False, return_rejected=False)[source]¶
Split both a DataFrame or an Iterable of DataFrames based on matching values in a given column.
- Parameters:
data (
pd.DataFrameor iterable ofpd.DataFrame) – DataFrame to be split.selection (
dict) – Mapping of a column name to an iterable of allowed values. Example: {“city”: [“London”, “Berlin”]}.reset_index (
bool, optional) – Whether to reset index in returned DataFrames.inverse (
bool, optional) – If True invert the selection.return_rejected (
bool, optional) – If True return rejected rows as the second output. If False the rejected output is empty but dtype-preserving.
- Return type:
tuple[DataFrame|ParquetStreamReader,DataFrame|ParquetStreamReader,Index|MultiIndex,Index|MultiIndex]- Returns:
tupleofpd.DataFrameorParquetStreamReaderandpd.DataFrameorParquetStreamReaderandpd.Indexorpd.MultiIndexandpd.Indexorpd.MultiIndex– Selected rows (all mask columns True), rejected rows, original indexes of selection and original indexes of rejection.
- cdm_reader_mapper.common.split_by_index(data, index, reset_index=False, inverse=False, return_rejected=False)[source]¶
Split both a DataFrame or an Iterable of DataFrames by selecting specific index labels.
- Parameters:
data (
pd.DataFrameor iterable ofDataFrame) – DataFrame to be split.index (
labelorsequenceoflabels) – Index values to select.reset_index (
bool, optional) – If True reset index in returned DataFrames.inverse (
bool, optional) – If True select rows not in index.return_rejected (
bool, optional) – If True return rejected rows as the second output. If False the rejected output is empty but dtype-preserving.
- Return type:
tuple[DataFrame|ParquetStreamReader,DataFrame|ParquetStreamReader,Index|MultiIndex,Index|MultiIndex]- Returns:
tupleofpd.DataFrameorParquetStreamReaderandpd.DataFrameorParquetStreamReaderandpd.Indexorpd.MultiIndexandpd.Indexorpd.MultiIndex– Selected rows (all mask columns True), rejected rows, original indexes of selection and original indexes of rejection.
Submodules¶
cdm_reader_mapper.common.getting_files module¶
Internal pandas local file operator.
- cdm_reader_mapper.common.getting_files.get_path(path)[source]¶
Get path either from _files(path) or directly from file system.
- Parameters:
path (
str | Path) – If it points to an existing file on disk, that file is returned. Otherwise the value is interpreted as a module name or as<module>:<subpath>(e.g."mypkg:templates/index.html").- Return type:
- Returns:
Path | None– The resolved path orNoneif the resource cannot be found.
- cdm_reader_mapper.common.getting_files.load_file(name, github_url='https://github.com/glamod/cdm-testdata', branch='main', cache=True, cache_dir=PosixPath('/home/docs/.cache/cdm-testdata'), clear_cache=False, within_drs=True)[source]¶
Load file from the online Github-like repository.
- Parameters:
name (
stroros.PathLike) – Name of the file containing the dataset.github_url (
str) – URL to GitHub repository where the data is stored.branch (
str, optional) – For GitHub-hosted files, the branch to download from.cache (
bool) – If True, then cache data locally for use on subsequent calls.cache_dir (
Path) – The directory in which to search for and write cached data.clear_cache (
bool) – If True, clear cache directory.within_drs (
bool) – If True, then download data within data reference syntax.
- Return type:
- Returns:
Path– Destination path of the downloaded file.
cdm_reader_mapper.common.inspect module¶
Common Data Model (CDM) pandas inspection operators.
Created on Wed Jul 3 09:48:18 2019
@author: iregon
- cdm_reader_mapper.common.inspect.count_by_cat(data, columns=None)[source]¶
Count unique values per column in a DataFrame or a Iterable of DataFrame.
- Parameters:
data (
pandas.DataFrameorIterable[pd.DataFrame]) – Input dataset.columns (
str,listortuple, optional) – Name(s) of the data column(s) to be selected. If None, all columns are used.
- Return type:
- Returns:
Dict[str | tuple[str,str],int]– Dictionary where each key is a column name, and each value is a dictionary mapping unique values (including NaN as ‘nan’) to their counts.
Notes
Works with large files via ParquetStreamReader by iterating through chunks.
- cdm_reader_mapper.common.inspect.get_length(data)[source]¶
Get the total number of rows in a pandas object.
- Parameters:
data (
pandas.DataFrameorIterable[pd.DataFrame]) – Input dataset.- Return type:
- Returns:
int– Total number of rows.
Notes
Works with large files via ParquetStreamReader by using a specialized handler to count rows without loading the entire file into memory.
cdm_reader_mapper.common.io_files module¶
Utility function for reading and writing files.
- cdm_reader_mapper.common.io_files.get_filename(pattern, path='.', extension='pq', separator='_')[source]¶
Construct a filename from a sequence of string components.
- Parameters:
pattern (
Sequence[str]) – Iterable of string components to be joined with hyphens (e.g., [“sales”, “2024”, “Q1”]). Empty or falsy items are ignored.path (
strorPath-like, optional) – Directory in which the file should be placed. Default is current directory “.”.extension (
str, optional) – File extension, with or without a leading dot (e.g., “pq” or “.pq”). Default is “pq”.separator (
str, optional) – Separator to join the pattern components (default “_”).
- Return type:
- Returns:
str– Full file path including directory, filename, and extension. Returns an empty string if pattern is empty or contains no truthy elements.
Notes
Any empty or falsy parts of pattern will be removed.
The extension is normalized to always begin with a leading dot.
Examples
>>> get_filename(["data", "2025"]) './data-2025.psv'
>>> get_filename(["report", ""], path="/tmp", extension=".txt") '/tmp/report.txt'
>>> get_filename(["part1", "part2"], separator="_") './part1_part2.psv'
cdm_reader_mapper.common.iterators module¶
Utilities for handling pandas TextParser objects safely.
- class cdm_reader_mapper.common.iterators.ParquetStreamReader(source)[source]¶
Bases:
objectA wrapper that mimics pandas.io.parsers.TextFileReader.
- Parameters:
source (
iterableoriteratororcallable) – Data source yieldingpandas.DataFrameorpandas.Seriesobjects. If a callable is provided, it must return a fresh iterator each time it is called (useful for copying/resetting the stream).- Variables:
columns (
listorpandas.Index) – Column labels inferred from the first chunk.dtypes (
dictorpandas.Series) – Data types inferred from the first chunk.attrs (
dict) – User-defined metadata associated with the stream.
Notes
The stream is consumed as it is iterated.
Use copy() to create an independent iterator.
Some operations (e.g., read()) load all data into memory and may not be suitable for large datasets.
- copy()[source]¶
Create an independent copy of the stream.
- Return type:
- Returns:
ParquetStreamReader– A new stream reader with independent iteration state.- Raises:
ValueError – If the stream has been closed.
- property empty: bool¶
Check whether the stream has any remaining data.
- Returns:
bool– True if the stream is exhausted, False otherwise.
Notes
This method creates a temporary copy of the stream to check for remaining elements without consuming the original.
- get_chunk()[source]¶
Return the next available chunk.
This is equivalent to calling
next(reader)and is provided for API compatibility with pandas readers.- Return type:
- Returns:
pandas.DataFrameorpandas.Series– The next chunk of data.
- prepend(chunk)[source]¶
Push a chunk back onto the front of the stream.
Useful for peeking at the first chunk without losing it.
- Parameters:
chunk (
pandas.DataFrameorpandas.Series) – The chunk to prepend.- Return type:
- read()[source]¶
Read all remaining data into a single DataFrame.
This consumes the entire stream and concatenates all remaining chunks into one DataFrame.
- Return type:
- Returns:
pandas.DataFrame– Concatenated result of all remaining chunks. Returns an empty DataFrame if the stream is exhausted.
Warning
This operation loads all data into memory and may not be suitable for large datasets.
- reset_index(drop=False)[source]¶
Reset the index across all chunks to a continuous range.
- Parameters:
drop (
bool, defaultFalse) – If True, do not insert the old index as a column. If False, the new index is also inserted as a column named “index”.- Return type:
- Returns:
ParquetStreamReader– A new stream reader with reindexed chunks.- Raises:
ValueError – If the stream has been closed.
- class cdm_reader_mapper.common.iterators.ProcessFunction(data, func, func_args=None, func_kwargs=None, **kwargs)[source]¶
Bases:
objectStores data and a callable function with optional arguments for processing.
- Parameters:
data (
pd.DataFrame,pd.Series, iterable ofpd.DataFrameor iterable ofpd.Series) – Input data to be processed.func (
Callable) – A callable that will be applied to data.func_args (
Any,listofAnyortupleofAny, optional) – Positional arguments to pass to func.func_kwargs (
dict, optional) – Keyword arguments to pass to func.**kwargs (
Any) – Additional metadata or configuration parameters stored with the instance.
- cdm_reader_mapper.common.iterators.ensure_parquet_reader(obj)[source]¶
Ensure obj is a ParquetStreamReader.
- Parameters:
obj (
Any) – Object that may represent a ParquetStreamReader, an iterator of pd.DataFrame or pd.Series objects, or a static value.- Return type:
- Returns:
Any– If obj is already a ParquetStreamReader, it is returned unchanged. If obj is an iterator, it is converted into a ParquetStreamReader. Otherwise, obj is returned as-is (treated as a static value).
- cdm_reader_mapper.common.iterators.is_valid_iterator(reader)[source]¶
Check if reader is a valid Iterable.
- cdm_reader_mapper.common.iterators.parquet_stream_from_iterable(iterable)[source]¶
Stream an iterable of DataFrame/Series to parquet and return a disk-backed ParquetStreamReader.
Memory usage remains constant.
- Parameters:
iterable (
Iterable pf pd.DataFrameorpd.Series) – An iterable of pandas DataFrame or Series objects to be streamed to disk.- Return type:
- Returns:
ParquetStreamReader– A disk-backed stream reader that lazily reads the provided iterable from Parquet files stored in a temporary directory.- Raises:
ValueError – If the input iterable is empty.
TypeError – If elements in the iterable are not pandas DataFrame or Series objects, or if mixed types are provided across chunks.
- cdm_reader_mapper.common.iterators.process_disk_backed(reader, func, func_args=None, func_kwargs=None, requested_types=(<class 'pandas.core.frame.DataFrame'>, <class 'pandas.core.series.Series'>), non_data_output='first', non_data_proc=None, non_data_proc_args=None, non_data_proc_kwargs=None, makecopy=True)[source]¶
Consume a stream of DataFrames, processes them, and returns a tuple of results.
DataFrames are cached to disk (Parquet) and returned as generators.
- Parameters:
reader (
Iteratorofpd.DataFrameorpd.Series) – Input stream of DataFrame or Series objects to be processed in chunks.func (
Callable) – Function applied to each synchronized set of chunks from the stream. May return data objects (pd.DataFrame or pd.Series) and/or metadata.func_args (
tupleofAny, optional) – Additional positional arguments passed to func. Defaults to empty tuple.func_kwargs (
dict, optional) – Additional keyword arguments passed to func. Defaults to empty dict.requested_types (
typeortupleoftype, default(pd.DataFrame,pd.Series)) – Types treated as data outputs from func. All other outputs are treated as metadata.non_data_output (
{"first", "acc"}, default"first") – Strategy for handling non-data outputs: - “first”: only the first chunk’s metadata is kept - “acc”: accumulate metadata across all chunksnon_data_proc (
Callable, optional) – Optional function applied to aggregated non-data outputs after processing.non_data_proc_args (
tupleofAny, optional) – Positional arguments for non_data_proc.non_data_proc_kwargs (
dict, optional) – Keyword arguments for non_data_proc.makecopy (
bool, defaultTrue) – If True, ensures independent copies of input streams are used internally.
- Return type:
- Returns:
tupleofAnyorNone– A tuple containing:One or more ParquetStreamReader objects for chunked data outputs (if any data was produced)
Processed non-data outputs (metadata), optionally transformed by non_data_proc
- Raises:
ValueError – If non_data_proc is provided but not callable.
- cdm_reader_mapper.common.iterators.process_function(data_only=False, postprocessing=None)[source]¶
Decorator to apply function to both pd.DataFrame and Iterable[pd.DataFrame].
- Parameters:
data_only (
bool, defaultFalse) – If True, only the primary data output is returned from the processed result.postprocessing (
dict, optional) – Optional configuration for a postprocessing step applied to each result. Expected keys: - “func”: callable applied to each DataFrame/Series/stream output - “kwargs”: list or dict of argument names taken from the original call
- Return type:
- Returns:
Callable– A decorator that wraps a function so it can operate on both in-memory pandas objects and disk-backed ParquetStreamReader streams.- Raises:
ValueError – If a provided postprocessing function is not callable.
cdm_reader_mapper.common.json_dict module¶
JSON dictionary manipulator.
- cdm_reader_mapper.common.json_dict.collect_json_files(idir, *args, base=None, name=None)[source]¶
Collect JSON files recursively based on directory and optional subdirectories.
- Parameters:
- Return type:
- Returns:
listofPath– List of matching JSON file paths.
- cdm_reader_mapper.common.json_dict.combine_dicts(list_of_files, base=None)[source]¶
Combine multiple JSON files or dictionaries into a single dictionary.
Supports nested ‘substitute’ references to recursively load additional JSON files.
cdm_reader_mapper.common.logging_hdlr module¶
Initialize logger.
Created on Wed Apr 3 08:45:03 2019
@author: iregon
- cdm_reader_mapper.common.logging_hdlr.init_logger(module, level='INFO', fn=None)[source]¶
Initialize and configure a logger for a given module.
- Parameters:
- Return type:
- Returns:
logging.Logger– Configured logger instance.
Notes
This function calls logging.basicConfig to configure the root logger. Repeated calls to this function may not reconfigure logging unless reload(logging) is used.
cdm_reader_mapper.common.replace module¶
Common Data Model (CDM) pandas replacement operators.
Created on Wed Jul 3 09:48:18 2019
Replace columns from right dataframe into left dataframe
Replacement occurs on a pivot column, that might have the same name in both dfs (pivot_c) or be different (pivot_l and pivot_r)
Can replace one or multiple columns and support multiindexing (tested only on left, so far…)
- Replacement arguments:
rep_c : list or string of column name(s) to replace, they are the same name in left and right
rep_map: dictionary with {col_l:col_r…} if not the same
@author: iregon
- cdm_reader_mapper.common.replace.replace_columns(df_l, df_r, pivot_c=None, pivot_l=None, pivot_r=None, rep_c=None, rep_map=None)[source]¶
Replace columns in one DataFrame using row-matching from another.
This function works for both a pd.DataFrame and any Iterable of of pandas DataFrames.
- Parameters:
df_l (
pandas.DataFrameorIterable[pd.dataFrame]) – The left DataFrame whose columns will be replaced.df_r (
pandas.DataFrameorIterable[pd.dataFrame]) – The right DataFrame providing replacement values.pivot_c (
str, optional) – A single pivot column present in both DataFrames. Overrides pivot_l and pivot_r.pivot_l (
str, optional) – Pivot column in df_l. Used only when pivot_c is not supplied.pivot_r (
str, optional) – Pivot column in df_r. Used only when pivot_c is not supplied.rep_c (
strorlistofstr, optional) – One or more column names to replace in df_l. Ignored if rep_map is supplied.rep_map (
dict, optional) – Mapping between left and right column names as {left_col: right_col}.
- Returns:
pd.DataFrameorParquetStreamReader– Updated data with replacements applied.- Raises:
TypeError – If df_l or df_r is not a pandas DataFrame.
If one of pivot_l and pivot_r is not defined. - If rep_map and rep_c is not defined. - If replacement source columns not found in df_r.
Notes
This function logs errors and returns None instead of raising exceptions.
cdm_reader_mapper.common.select module¶
Common Data Model (CDM) pandas selection operators.
Created on Wed Jul 3 09:48:18 2019
@author: iregon
- cdm_reader_mapper.common.select.split_by_boolean(data, mask, boolean, reset_index=False, inverse=False, return_rejected=False)[source]¶
Split both a DataFrame and an Iterable of DataFrames using a boolean mask via
split_dataframe_by_boolean.- Parameters:
data (
pd.DataFrameor iterable ofpd.DataFrame) – DataFrame to be split.mask (
pd.DataFrameor iterable ofpd.DataFrame) – Boolean mask with the same length as data.boolean (
bool) – Determines mask interpretation:If True select rows where all mask columns are True.
If False select rows where any mask column is False.
reset_index (
bool, optional) – If True reset the index of returned DataFrames.inverse (
bool, optional) – If True invert the selection performed by the underlying function.return_rejected (
bool, optional) – If True return rejected rows as the second output. If False the rejected output is empty but dtype-preserving.
- Return type:
tuple[DataFrame|ParquetStreamReader,DataFrame|ParquetStreamReader,Index|MultiIndex,Index|MultiIndex]- Returns:
tupleofpd.DataFrameorParquetStreamReaderandpd.DataFrameorParquetStreamReaderandpd.Indexorpd.MultiIndexandpd.Indexorpd.MultiIndex– Selected rows (all mask columns True), rejected rows, original indexes of selection and original indexes of rejection.
- cdm_reader_mapper.common.select.split_by_boolean_false(data, mask, reset_index=False, inverse=False, return_rejected=False)[source]¶
Split both a DataFrame or an Iterable of DataFrames where boolean mask is False.
- Parameters:
data (
pd.DataFrameorIterable[pd.DataFrame]) – DataFrame to be split.mask (
pd.DataFrameorIterable[pd.DataFrame]) – Boolean mask with the same length as data.reset_index (
bool, optional) – If True reset indices in returned DataFrames.inverse (
bool, optional) – If True invert the selection.return_rejected (
bool, optional) – If True return rejected rows as the second output. If False the rejected output is empty but dtype-preserving.
- Return type:
tuple[DataFrame|ParquetStreamReader,DataFrame|ParquetStreamReader,Index|MultiIndex,Index|MultiIndex]- Returns:
tupleofpd.DataFrameorParquetStreamReaderandpd.DataFrameorParquetStreamReaderandpd.Indexorpd.MultiIndexandpd.Indexorpd.MultiIndex– Selected rows (all mask columns True), rejected rows, original indexes of selection and original indexes of rejection.
- cdm_reader_mapper.common.select.split_by_boolean_true(data, mask, reset_index=False, inverse=False, return_rejected=False)[source]¶
Split both a DataFrame or an Iterable of DataFrames where boolean mask is True.
- Parameters:
data (
pd.DataFrameor iterable ofpd.DataFrame) – DataFrame to be split.mask (
pd.DataFrameor iterable ofpd.DataFrame) – Boolean mask with the same length as data.reset_index (
bool, optional) – If True reset indices in returned DataFrames.inverse (
bool, optional) – If True invert the selection.return_rejected (
bool, optional) – If True return rejected rows as the second output. If False the rejected output is empty but dtype-preserving.
- Return type:
tuple[DataFrame|ParquetStreamReader,DataFrame|ParquetStreamReader,Index|MultiIndex,Index|MultiIndex]- Returns:
tupleofpd.DataFrameorParquetStreamReaderandpd.DataFrameorParquetStreamReaderandpd.Indexorpd.MultiIndexandpd.Indexorpd.MultiIndex– Selected rows (all mask columns True), rejected rows, original indexes of selection and original indexes of rejection.
- cdm_reader_mapper.common.select.split_by_column_entries(data, selection, reset_index=False, inverse=False, return_rejected=False)[source]¶
Split both a DataFrame or an Iterable of DataFrames based on matching values in a given column.
- Parameters:
data (
pd.DataFrameor iterable ofpd.DataFrame) – DataFrame to be split.selection (
dict) – Mapping of a column name to an iterable of allowed values. Example: {“city”: [“London”, “Berlin”]}.reset_index (
bool, optional) – Whether to reset index in returned DataFrames.inverse (
bool, optional) – If True invert the selection.return_rejected (
bool, optional) – If True return rejected rows as the second output. If False the rejected output is empty but dtype-preserving.
- Return type:
tuple[DataFrame|ParquetStreamReader,DataFrame|ParquetStreamReader,Index|MultiIndex,Index|MultiIndex]- Returns:
tupleofpd.DataFrameorParquetStreamReaderandpd.DataFrameorParquetStreamReaderandpd.Indexorpd.MultiIndexandpd.Indexorpd.MultiIndex– Selected rows (all mask columns True), rejected rows, original indexes of selection and original indexes of rejection.
- cdm_reader_mapper.common.select.split_by_index(data, index, reset_index=False, inverse=False, return_rejected=False)[source]¶
Split both a DataFrame or an Iterable of DataFrames by selecting specific index labels.
- Parameters:
data (
pd.DataFrameor iterable ofDataFrame) – DataFrame to be split.index (
labelorsequenceoflabels) – Index values to select.reset_index (
bool, optional) – If True reset index in returned DataFrames.inverse (
bool, optional) – If True select rows not in index.return_rejected (
bool, optional) – If True return rejected rows as the second output. If False the rejected output is empty but dtype-preserving.
- Return type:
tuple[DataFrame|ParquetStreamReader,DataFrame|ParquetStreamReader,Index|MultiIndex,Index|MultiIndex]- Returns:
tupleofpd.DataFrameorParquetStreamReaderandpd.DataFrameorParquetStreamReaderandpd.Indexorpd.MultiIndexandpd.Indexorpd.MultiIndex– Selected rows (all mask columns True), rejected rows, original indexes of selection and original indexes of rejection.