cdm_reader_mapper.core package¶

Common Data Model core package.

Submodules¶

cdm_reader_mapper.core._utilities module¶

Common Data Model (CDM) DataBundle class.

class cdm_reader_mapper.core._utilities.SubscriptableMethod(func)[source]¶

Bases: object

Allows both method calls and subscript access.

Parameters:: func (Any) – Underlying callable or subscriptable object.

cdm_reader_mapper.core._utilities.combine_attribute_values(first_value, iterator, attr)[source]¶

Collect values of an attribute across all chunks and combine them.

Parameters:

first_value (Any) – The value from the first chunk (already read).
iterator (Iterator/ParquetStreamReader) – The stream positioned at the second chunk.
attr (str) – The attribute name to fetch from remaining chunks.

Return type:

Any

Returns:

Any – Combined attribute values of iterator.

cdm_reader_mapper.core._utilities.method(attr_func, *args, **kwargs)[source]¶

Handle both method calls and subscriptable attributes.

Parameters:

attr_func (Any) – A callable object (e.g., function or method) or a subscriptable object (e.g., list, tuple, dict, or array-like).
*args (Any) – Positional arguments passed to attr_func, or used as the index/key when attr_func is subscriptable.
**kwargs (Any) – Keyword arguments passed to attr_func. Ignored if attr_func is not callable.

Return type:

Any

Returns:

Any – The result of calling attr_func(*args, **kwargs) if it is callable, or the result of attr_func[args] if it is subscriptable.

Raises:

ValueError – If attr_func is neither callable nor subscriptable, or if indexing fails due to an invalid key or index.

cdm_reader_mapper.core._utilities.reader_method(data, attr, *args, process_kwargs=None, **kwargs)[source]¶

Handle operations on chunked data (ParquetStreamReader).

Uses process_disk_backed to stream processing without loading into RAM.

Parameters:

data (pd.DataFrame or ParquetStreamReader) – Input data to operate on.
attr (str) – Name of attribute or method of to apply.
*args (Any) – Positional arguments passed to the attribute or method.
process_kwargs (dict, optional) – Additional keyword arguments passed to the streaming processor.
**kwargs (Any) – Keyword arguments passed to the attribute or method. Supports inplace to update db instead of returning a result.

Return type:

ParquetStreamReader | None

Returns:

ParquetStreamReader or None – A new stream with the applied operation.

cdm_reader_mapper.core.databundle module¶

Common Data Model (CDM) DataBundle class.

class cdm_reader_mapper.core.databundle.DataBundle(data=None, columns=None, dtypes=None, parse_dates=None, encoding=None, mask=None, imodel=None, mode='data')[source]¶

Bases: object

Container for tabular data and associated metadata.

This class wraps either an in-memory pd.DataFrame or a ParquetStreamReader for chunked, disk-backed processing. It provides a unified interface for accessing DataFrame-like attributes and methods, transparently handling streaming data where required.

Parameters:

data (pandas.DataFrame or Iterable[pandas.DataFrame] or ParquetStreamReader, optional) – Input data. If an iterable is provided, it is converted into a ParquetStreamReader for streaming.
columns (pandas.Index or pandas.MultiIndex or list, optional) – Column labels used when initializing empty data.
dtypes (pandas.Series or dict, optional) – Data types for columns.
parse_dates (list or bool, optional) – Instructions for parsing dates.
encoding (str, optional) – Encoding associated with the data.
mask (pandas.DataFrame or Iterable[pandas.DataFrame] or ParquetStreamReader, optional) – Boolean mask aligned with data. If not provided, an empty mask is created.
imodel (str, optional) – Name of the input data model.
mode ({"data", "tables"}, default "data") – Data representation mode.

Examples

Getting a DataBundle while reading data from disk.

>>> from cdm_reader_mapper import read_mdf
>>> db = read_mdf(source="file_on_disk", imodel="custom_model_name")

Constructing a DataBundle from already read MDf data.

>>> from cdm_reader_mapper import DataBundle
>>> read = read_mdf(source="file_on_disk", imodel="custom_model_name")
>>> data_ = read.data
>>> mask_ = read.mask
>>> db = DataBundle(data=data_, mask=mask_)

Constructing a DataBundle from already read CDM data.

>>> from cdm_reader_mapper import read_tables
>>> tables = read_tables("path_to_files").data
>>> db = DataBundle(data=tables, mode="tables")

add(addition, inplace=False)[source]¶

Adding information to a DataBundle.

Parameters:

addition (dict) – Additional elements to add to the DataBundle.
inplace (bool, default: False) – If True add datasets in DataBundle else return a copy of DataBundle with added datasets.

Return type:

DataBundle | None

Returns:

DataBundle or None – DataBundle with added information or None if “inplace=True”.

Examples

>>> tables = read_tables("path_to_files")
>>> db = db.add({"data": tables})

property columns: Index | MultiIndex¶

Column labels of data.

Returns:: pd.Index or pd.MultiIndex – Column labels of the underlying MDf data.

copy()[source]¶

Make deep copy of a DataBundle.

Return type:: DataBundle
Returns:: DataBundle – Copy of a DataBundle.

Examples

>>> db2 = db.copy()

correct_datetime(imodel=None, inplace=False, **kwargs)[source]¶

Correct datetime information in data.

Parameters:

imodel (str, optional) – Name of the MFD/CDM data model.
inplace (bool, default: False) – If True overwrite data in DataBundle else return a copy of DataBundle with datetime-corrected values in data.
**kwargs (Any) – Additional keyword-arguments for correcting datetime.

Return type:

DataBundle | None

Returns:

DataBundle or None – DataBundle with corrected datetime information or None if “inplace=True”.

See also

DataBundle.correct_pt: Correct platform type information in data.
DataBundle.validate_datetime: Validate datetime information in data.
DataBundle.validate_id: Validate station id information in data.

Notes

For more information see correct_datetime()

Examples

>>> df_dt = db.correct_datetime()

correct_pt(imodel=None, inplace=False, **kwargs)[source]¶

Correct platform type information in data.

Parameters:

imodel (str, optional) – Name of the MFD/CDM data model.
inplace (bool, default: True) – If True overwrite data in DataBundle else return a copy of DataBundle with platform-corrected values in data.
**kwargs (Any) – Additional keyword-arguments for correcting platform type.

Return type:

DataBundle | None

Returns:

DataBundle or None – DataBundle with corrected platform type information or None if “inplace=True”.

See also

DataBundle.correct_datetime: Correct datetime information in data.
DataBundle.validate_id: Validate station id information in data.
DataBundle.validate_datetime: Validate datetime information in data.

Notes

For more information see correct_pt()

Examples

>>> df_pt = db.correct_pt()

property data: DataFrame | ParquetStreamReader¶

Underlying MDF data.

Returns:: pd.DataFrame or ParquetStreamReader – Underlying MDf data.

property dtypes: Series | dict[str, Any] | None¶

Dictionary of data types on data.

Returns:: pd.Series or dict or None – Data types of underlying MDF data.

property encoding: str | None¶

A string representing the encoding to use in the data.

Returns:: str or None – String representing the encoding to use in the underlying MDF data.

See also

pd.to_csv(): Write data with encoding to CSV file.

property imodel: str | None¶

Name of the MDF/CDM input model.

Returns:: str or None – Name of the MDF/CDM input model if available.

map_model(imodel=None, inplace=False, **kwargs)[source]¶

Map data to the Common Data Model.

Parameters:

imodel (str, optional) – Name of the MFD/CDM data model.
inplace (bool, default: False) – If True overwrite data in DataBundle else return a copy of DataBundle with data as CDM tables.
**kwargs (Any) – Additional keyword-arguments for mapping to CDM.

Return type:

DataBundle | None

Returns:

DataBundle or None – DataBundle containing data mapped to the CDM or None if inplace=True.

Notes

For more information see map_model()

Examples

>>> cdm_tables = db.map_model()

property mask: DataFrame | ParquetStreamReader¶

MDF validation mask.

Returns:: pd.DataFrame or ParquetStreamReader – Validation mask of the underlying MDF data.

property mode: str¶

Data mode.

Returns:: str – Current data mode.
Raises:: TypeError – If mode of the underlying data is not a string.

property parse_dates: list[Any] | bool | None¶

Information of how to parse dates in data.

Returns:: list or bool or None – Information of how to parse dates in underlying MDF data.

See also

pd.read_csv(): Read CSV file using pandas.

replace_columns(df_corr, subset=None, inplace=False, **kwargs)[source]¶

Replace columns in data.

Parameters:

df_corr (pd.DataFrame) – Data to be inplaced.
subset (str or list of str, optional) – Select subset by columns. This option is useful for multi-indexed data.
inplace (bool, default: False) – If True overwrite data in DataBundle else return a copy of DataBundle with replaced column names in data.
**kwargs (Any) – Additional keyword-arguments for replacing columns.

Return type:

DataBundle | None

Returns:

DataBundle or None – DataBundle with replaced column names or None if “inplace=True”.

Notes

For more information see replace_columns()

Examples

>>> import pandas as pd
>>> df_corr = pd.read_csv("correction_file_on_disk")
>>> df_repl = db.replace_columns(df_corr)

select_where_all_false(inplace=False, do_mask=True, **kwargs)[source]¶

Select rows from data where all column entries in mask are False.

Parameters:

inplace (bool, default: False) – If True overwrite data in DataBundle else return a copy of DataBundle with invalid values only in data.
do_mask (bool, default: True) – If True also do selection on mask.
**kwargs (Any) – Additional keyword-arguments for splitting data where all entries are False.

Return type:

DataBundle | None

Returns:

DataBundle or None – DataBundle containing rows where all column entries in mask are False or None if inplace=True.

See also

DataBundle.select_where_all_true: Select rows from data where all entries in mask are True.
DataBundle.select_where_entry_isin: Select rows from data where column entries are in a specific value list.
DataBundle.select_where_index_isin: Select rows from data within specific index list.

Notes

For more information see split_by_boolean_false()

Examples

Select without overwriting the old data.

>>> db_selected = db.select_where_all_false()

Select valid values only with overwriting the old data.

>>> db.select_where_all_false(inplace=True)
>>> df_selected = db.data

select_where_all_true(inplace=False, do_mask=True, **kwargs)[source]¶

Select rows from data where all column entries in mask are True.

Parameters:

inplace (bool, default: False) – If True overwrite data in DataBundle else return a copy of DataBundle with valid values only in data.
do_mask (bool, default: True) – If True also do selection on mask.
**kwargs (Any) – Additional keyword-arguments for splitting data where all entries are True.

Return type:

DataBundle | None

Returns:

DataBundle or None – DataBundle containing rows where all column entries in mask are True or None if inplace=True.

See also

DataBundle.select_where_all_false: Select rows from data where all entries in mask are False.
DataBundle.select_where_entry_isin: Select rows from data where column entries are in a specific value list.
DataBundle.select_where_index_isin: Select rows from data within specific index list.

Notes

For more information see split_by_boolean_true()

Examples

Select without overwriting the old data.

>>> db_selected = db.select_where_all_true()

Select overwriting the old data.

>>> db.select_where_all_true(inplace=True)
>>> df_selected = db.data

select_where_entry_isin(selection, inplace=False, do_mask=True, **kwargs)[source]¶

Select rows from data where column entries are in a specific value list.

Parameters:

selection (dict) – Keys: Column names in data. Values: Specific value list.
inplace (bool, default: False) – If True overwrite data in DataBundle else return a copy of DataBundle with selected columns only in data.
do_mask (bool, default: True) – If True also do selection on mask.
**kwargs (Any) – Additional keyword-arguments for splitting data where entries within a specific value list.

Return type:

DataBundle | None

Returns:

DataBundle or None – DataBundle containing rows where column entries are in a specific value list or None if inplace=True.

See also

DataBundle.select_where_index_isin: Select rows from data within specific index list.
DataBundle.select_where_all_true: Select rows from data where all entries in mask are True.
DataBundle.select_where_all_false: Select rows from data where all entries in mask are False.

Notes

For more information see split_by_column_entries()

Examples

Select without overwriting the old data.

>>> db_selected = db.select_where_entry_isin(
...     selection={("c1", "B1"): [26, 41]},
... )

Select with overwriting the old data.

>>> db.select_where_entry_isin(selection={("c1", "B1"): [26, 41]}, inplace=True)
>>> df_selected = db.data

select_where_index_isin(index, inplace=False, do_mask=True, **kwargs)[source]¶

Select rows from data where indexes within a specific index list.

Parameters:

index (list of int) – Specific index list.
inplace (bool, default: False) – If True overwrite data in DataBundle else return a copy of DataBundle with selected rows only in data.
do_mask (bool, default: True) – If True also do selection on mask.
**kwargs (Any) – Additional keyword-arguments for splitting data where indexes within a specific index list.

Return type:

DataBundle | None

Returns:

DataBundle or None – DataBundle containing rows where indexes are within a specific index list or None if inplace=True.

See also

DataBundle.select_where_entry_isin: Select rows from data where column entries are in a specific value list.
DataBundle.select_where_all_true: Select rows from data where all entries in mask are True.
DataBundle.select_where_all_false: Select rows from data where all entries in mask are False.

Notes

For more information see split_by_index()

Examples

Select without overwriting the old data.

>>> db_selected = db.select_where_index_isin([0, 2, 4])

Select with overwriting the old data.

>>> db.select_where_index_isin(index=[0, 2, 4], inplace=True)
>>> df_selected = db.data

split_by_boolean_false(do_mask=True, **kwargs)[source]¶

Split data by rows where all column entries in mask are False.

Parameters:

do_mask (bool, default: True) – If True also do selection on mask.
**kwargs (Any) – Additional keyword-arguments for splitting data where mask is False.

Return type:

tuple[DataBundle, DataBundle]

Returns:

tuple – First DataBundle including rows where all column entries in mask are False. Second DataBundle including rows where all column entries in mask are True.

See also

DataBundle.split_by_boolean_false: Split data by rows where all entries in mask are True.
DataBundle.split_by_column_entries: Split data by rows where column entries are in a specific value list.
DataBundle.split_by_index: Split data by rows within specific index list.

Notes

For more information see split_by_boolean_false()

Examples

Split DataBundle.

>>> db_false, db_true = db.split_by_boolean_false()

split_by_boolean_true(do_mask=True, **kwargs)[source]¶

Split data by rows where all column entries in mask are True.

Parameters:

do_mask (bool, default: True) – If True also do selection on mask.
**kwargs (Any) – Additional keyword-arguments for splitting data where mask is False.

Return type:

tuple[DataBundle, DataBundle]

Returns:

tuple – First DataBundle including rows where all column entries in mask are True. Second DataBundle including rows where all column entries in mask are False.

See also

DataBundle.split_by_boolean_false: Split data by rows where all entries in mask are False.
DataBundle.split_by_column_entries: Split data by rows where column entries are in a specific value list.
DataBundle.split_by_index: Split data by rows within specific index list.

Notes

For more information see split_by_boolean_true()

Examples

Split DataBundle.

>>> db_true, db_false = db.split_by_boolean_true()

split_by_column_entries(selection, do_mask=True, **kwargs)[source]¶

Split data by rows where column entries are in a specific value list.

Parameters:

selection (dict) – Keys: Column names in data. Values: Specific value list.
do_mask (bool, default: True) – If True also do selection on mask.
**kwargs (Any) – Additional keyword-arguments for splitting data by column entries.

Return type:

tuple[DataBundle, DataBundle]

Returns:

tuple – First DataBundle including rows where column entries are in a specific value list. Second DataBundle including rows where column entries are not in a specific value list.

See also

DataBundle.split_by_index: Split data by rows within specific index list.
DataBundle.split_by_boolean_true: Split data by rows where all entries in mask are True.
DataBundle.split_by_boolean_false: Split data by rows where all entries in mask are False.

Notes

For more information see split_by_column_entries()

Examples

Split DataBundle.

>>> db_isin, db_isnotin = db.split_by_column_entries(
...     selection={("c1", "B1"): [26, 41]},
... )

split_by_index(index, do_mask=True, **kwargs)[source]¶

Split data by rows within specific index list.

Parameters:

index (list of int) – Specific index list.
do_mask (bool, default: True) – If True also do selection on mask.
**kwargs (Any) – Additional keyword-arguments for splitting data by index.

Return type:

tuple[DataBundle, DataBundle]

Returns:

tuple – First DataBundle including rows within specific index list. Second DataBundle including rows outside specific index list.

See also

DataBundle.split_by_column_entries: Select columns from data with specific values.
DataBundle.split_by_boolean_true: Split data by rows where all entries in mask are True.
DataBundle.split_by_boolean_false: Split data by rows where all entries in mask are False.

Notes

For more information see split_by_index()

Examples

Split DataBundle.

>>> db_isin, db_isnotin = db.split_by_index([0, 2, 4])

stack_h(other, datasets=('data', 'mask'), inplace=False, **kwargs)[source]¶

Stack multiple DataBundle’s horizontally.

Parameters:

other (DataBundle or Sequence of DataBundle) – List of other DataBundle to stack horizontally.
datasets (str or Sequence of str, default: [data, mask]) – List of datasets to be stacked.
inplace (bool, default: False) – If True overwrite datasets in DataBundle else return a copy of DataBundle with stacked datasets.
**kwargs (Any) – Additional keyword-arguments for stacking DataFrames horizontally.

Return type:

DataBundle | None

Returns:

DataBundle or None – Horizontally stacked DataBundle or None if inplace=True.

See also

DataBundle.stack_v: Stack multiple DataBundle’s vertically.

Notes

This is only working with pd.DataFrames, not with iterables of pd.DataFrames!
The DataFrames in the DataBundle may have different data columns!

Examples

>>> db = db1.stack_h(db2, datasets=["data", "mask"])

stack_v(other, datasets=('data', 'mask'), inplace=False, **kwargs)[source]¶

Stack multiple DataBundle’s vertically.

Parameters:

other (DataBundle or Sequence of DataBundle) – List of other DataBundle to stack vertically.
datasets (str or Sequence of str, default: (data, mask)) – List of datasets to be stacked.
inplace (bool, default: False) – If True overwrite datasets in DataBundle else return a copy of DataBundle with stacked datasets.
**kwargs (Any) – Additional keyword-arguments for stacking DataFrames vertically.

Return type:

DataBundle | None

Returns:

DataBundle or None – Vertically stacked DataBundle or None if “inplace=True”.

See also

DataBundle.stack_h: Stack multiple DataBundle’s horizontally.

Notes

This is only working with pd.DataFrames, not with iterables of pd.DataFrames!
The DataFrames in the DataBundle have to have the same data columns!

Examples

>>> db = db1.stack_v(db2, datasets=["data", "mask"])

unique(**kwargs)[source]¶

Get unique values of data.

Parameters:: **kwargs (Any) – Additional keyword-arguments for getting unique values.
Return type:: dict[str | tuple[str, str], dict[Any, int]]
Returns:: dict – Dictionary with unique values.

Notes

For more information see unique()

Examples

>>> db.unique(columns=("c1", "B1"))

validate_datetime(imodel=None, **kwargs)[source]¶

Validate datetime information in data.

Parameters:

imodel (str, optional) – Name of the MFD/CDM data model.
**kwargs (Any) – Additional keyword-arguments for validating datetime.

Return type:

DataFrame

Returns:

pd.DataFrame – DataFrame containing True and False values for each index in data. True: All datetime information in data row are valid. False: At least one datetime information in data row is invalid.

See also

DataBundle.validate_id: Validate station id information in data.
DataBundle.correct_datetime: Correct datetime information in data.
DataBundle.correct_pt: Correct platform type information in data.

Notes

For more information see validate_datetime()

Examples

>>> val_dt = db.validate_datetime()

validate_id(imodel=None, **kwargs)[source]¶

Validate station id information in data.

Parameters:

imodel (str, optional) – Name of the MFD/CDM data model.
**kwargs (Any) – Additional keyword-arguments for validating station id.

Return type:

DataFrame

Returns:

pd.DataFrame – DataFrame containing True and False values for each index in data. True: All station ID information in data row are valid. False: At least one station ID information in data row is invalid.

See also

DataBundle.validate_datetime: Validate datetime information in data.
DataBundle.correct_pt: Correct platform type information in data.
DataBundle.correct_datetime: Correct datetime information in data.

Notes

For more information see validate_id()

Examples

>>> val_dt = db.validate_id()

write(dtypes=None, parse_dates=None, encoding=None, mode=None, **kwargs)[source]¶

Write data on disk.

Parameters:

dtypes (dict, optional) – Data types of data.
parse_dates (list or bool, optional) – Information how to parse dates on data.
encoding (str, optional) – The encoding of the input file. Overrides the value in the imodel schema file.
mode ({data, tables}, optional) – Data mode.
**kwargs (Any) – Additional keword-arguments for writing data in disk.

See also

write_data: Write MDF data and validation mask to disk.
write_tables: Write CDM tables to disk.
read: Read original marine-meteorological data as well as MDF data or CDM tables from disk.
read_data: Read MDF data and validation mask from disk.
read_mdf: Read original marine-meteorological data from disk.

Return type:: None

Notes

If mode is “data” write data using write_data(). If mode is “tables” write data using write_tables().

Examples

>>> db.write()
read_tables : Read CDM tables from disk.

cdm_reader_mapper.core.reader module¶

Common Data Model (CDM) DataBundle class.

cdm_reader_mapper.core.reader.read(source, mode='mdf', **kwargs)[source]¶

Read either original marine-meteorological data or MDF data or CDM tables from disk.

Parameters:

source (str) – Source of the input data.
mode (str, {mdf, data, tables}, default: mdf) –

Read data mode:
- “mdf” to read original marine-meteorological data from disk and convert them to MDF data
- “data” to read MDF data from disk
- “tables” to read CDM tables from disk. Map MDF data to CDM tables with DataBundle.map_model().
**kwargs (Any) – Additional keyword-arguments passed to reader function.

Return type:

DataBundle

Returns:

DataBundle – Containing read data as pd.DataFrame or Iterable of pd.DataFrames.

See also

read_mdf: Read original marine-meteorological data from disk.
read_data: Read MDF data and validation mask from disk.
read_tables: Read CDM tables from disk.
write: Write either MDF data or CDM tables on disk.
write_data: Write MDF data and validation mask to disk.
write_tables: Write CDM tables to disk.

Notes

kwargs are the keyword arguments for the specific mode reader.

cdm_reader_mapper.core.writer module¶

Common Data Model (CDM) DataBundle class.

cdm_reader_mapper.core.writer.write(data, mode='data', **kwargs)[source]¶

Write either MDF data or CDM tables on disk.

Parameters:

data (pandas.DataFrame or Iterable[pd.DataFrame]) – Data to export.
mode (str, {data, tables}, default: data) –

Write data mode:
- “data” to write MDF data to disk
- “tables” to write CDM tables to disk. Map MDF data to CDM tables with DataBundle.map_model().
**kwargs (Any) – Additional key-word arguments used to write data on disk.

See also

write_data: Write MDF data and validation mask to disk.
write_tables: Write CDM tables to disk.
read: Read either original marine-meteorological data or MDF data or CDM tables from disk.
read_mdf: Read original marine-meteorological data from disk.
read_data: Read MDF data and validation mask from disk.
read_tables: Read CDM tables from disk.

Return type:: None

Notes

kwargs are the keyword arguments for the specific mode reader.