cdm_reader_mapper.mdf_reader.utils package¶

Common Data Model (CDM) reader utilities.

Submodules¶

cdm_reader_mapper.mdf_reader.utils.convert_and_decode module¶

Internal pandas converting operators.

class cdm_reader_mapper.mdf_reader.utils.convert_and_decode.Converters(dtype)[source]¶

Bases: object

Registry-based converter for pandas Series.

Converts object-typed Series into numeric, datetime, or cleaned object representations based on the configured dtype.

Parameters:: dtype (str) – Target output dtype identifier.

converter()[source]¶

Return the converter function registered for the configured dtype.

Return type:: Callable[..., Series]
Returns:: Callable – Converter function.
Raises:: KeyError – If no converter is registered for the dtype.

object_to_datetime(data, datetime_format='%Y%m%d')[source]¶

Convert object Series to pandas datetime.

Invalid values are coerced to NaT.

Parameters:

data (pd.Series) – Object-typed Series.
datetime_format (str, default "%Y%m%d") – Datetime parsing format.

Return type:

Series

Returns:

pd.Series – Datetime Series.

object_to_numeric(data, scale=None, offset=None)[source]¶

Convert object Series to numeric using Decimal arithmetic.

Right spaces are treated as zeros
Optional scale and offset may be applied
Boolean values are preserved
Invalid conversions return False

Parameters:

data (pd.Series) – Object-typed Series.
scale (numeric, optional) – Scale factor.
offset (numeric, optional) – Offset value.

Return type:

Series

Returns:

pd.Series – Converted Series.

object_to_object(data, disable_white_strip=False)[source]¶

Clean object Series by stripping whitespace and nullifying empty strings.

Parameters:

data (pd.Series) – Object-typed Series.
disable_white_strip (bool or {"l", "r"}, default False) – Control whitespace stripping behavior.

Return type:

Series

Returns:

pd.Series – Cleaned Series.

class cdm_reader_mapper.mdf_reader.utils.convert_and_decode.Decoders(dtype, encoding='base36')[source]¶

Bases: object

Registry-based decoder dispatcher for column-wise decoding.

Currently supports Base36 decoding for numeric-like fields.

Parameters:

dtype (str) – Target data type name (e.g. numeric field type).
encoding (str, default "base36") – Encoding scheme to use.

base36(data)[source]¶

Decode a pandas Series from Base36 to stringified base-10 integers.

Boolean values are preserved. Invalid values raise ValueError via int(…, 36).

Parameters:: data (pd.Series) – Input Series containing base36-encoded values.
Return type:: Series
Returns:: pd.Series – Decoded Series with stringified integers or booleans.

decoder()[source]¶

Return the decoder function for the configured dtype and encoding.

Return type:: Callable[[Series], Series] | None
Returns:: Callable or None – Decoder function accepting a pandas Series, or None if encoding is unsupported.
Raises:: KeyError – If no decoder is registered for the given dtype.

cdm_reader_mapper.mdf_reader.utils.convert_and_decode.convert_and_decode(data, convert_flag=True, decode_flag=True, converter_dict=None, converter_kwargs=None, decoder_dict=None)[source]¶

Convert and decode data entries by using a pre-defined data model.

Overwrite attribute data with converted and/or decoded data.

Parameters:

data (pd.DataFrame) – Data to convert and decode.
convert_flag (bool, default True) – If True, apply converters to the columns defined in converter_dict.
decode_flag (bool, default True) – If True, apply decoders to the columns defined in decoder_dict.
converter_dict (dict[str, callable], optional) – Column-specific converter functions. If None, defaults to empty dict.
converter_kwargs (dict[str, dict], optional) – Keyword arguments for each converter function.
decoder_dict (dict[str, callable], optional) – Column-specific decoder functions. If None, defaults to empty dict.

Return type:

DataFrame

Returns:

pd.DataFrame – DataFrame with converted and decoded columns.

cdm_reader_mapper.mdf_reader.utils.convert_and_decode.max_decimal_places(*decimals)[source]¶

Return the maximum number of decimal places among Decimal values.

Parameters:: *decimals (Decimal) – One or more Decimal values.
Return type:: int
Returns:: int – Maximum number of decimal places.

cdm_reader_mapper.mdf_reader.utils.convert_and_decode.to_numeric(x, scale, offset)[source]¶

Convert a value to a scaled Decimal with offset applied.

Parameters:

x (Any) – Input value to convert.
scale (Decimal) – Scale factor.
offset (Decimal) – Offset value.

Return type:

Decimal | bool

Returns:

Decimal | bool – Converted Decimal value, boolean, or False if invalid.

Notes

Boolean values are returned unchanged
Empty or invalid values return False
Strings are stripped and spaces replaced with zeros
Result is quantized to the maximum decimal precision of input, scale, or offset

cdm_reader_mapper.mdf_reader.utils.filereader module¶

Auxiliary functions and class for reading, converting, decoding and validating MDF files.

class cdm_reader_mapper.mdf_reader.utils.filereader.FileReader(imodel, ext_schema_path=None, ext_schema_file=None)[source]¶

Bases: object

Class to read marine-meteorological data.

Provides a high-level interface to read, parse, filter, convert, decode, and validate data from multiple sources (FWF, CSV, NetCDF).

Parameters:

imodel (str, optional) – Name of the data model (e.g., ‘ICOADS’).
ext_schema_path (str or Path, optional) – Directory of external MDF schema file.
ext_schema_file (str or Path, optional) – Path to external MDF schema file.

open_data(source, open_with='pandas', pd_kwargs=None, xr_kwargs=None, convert_kwargs=None, decode_kwargs=None, validate_kwargs=None, select_kwargs=None)[source]¶

Open and parse source data according to parser configuration.

Parameters:

source (str) – Path or pattern for input file(s).
open_with (str) – Parser backend: ‘pandas’ or ‘netcdf’.
pd_kwargs (dict, optional) – Additional key-word arguments for parsing pandas-readable data.
xr_kwargs (dict, optional) – Additional key-word arguments for parsing xarray-readable data.
convert_kwargs (dict, optional) – Additional key-word arguments for data conversion.
decode_kwargs (dict, optional) – Additional key-word arguments for data decoding.
validate_kwargs (dict, optional) – Additional key-word arguments for data validation.
select_kwargs (dict, optional) – Additional key-word arguments for selecting/filtering data.

Return type:

tuple[DataFrame, DataFrame, ParserConfig] | tuple[Iterable[DataFrame], Iterable[DataFrame], ParserConfig]

Returns:

tuple – (data, mask, config) or chunked equivalents if using Iterable[pd.DataFrame].

read(source, pd_kwargs=None, xr_kwargs=None, convert_kwargs=None, decode_kwargs=None, validate_kwargs=None, select_kwargs=None)[source]¶

Read and process data from the given source.

Parameters:

source (str) – Path to input file(s).
pd_kwargs (dict, optional) – Additional key-word arguments for parsing pandas-readable data.
xr_kwargs (dict, optional) – Additional key-word arguments for parsing xarray-readable data.
convert_kwargs (dict, optional) – Additional key-word arguments for data conversion.
decode_kwargs (dict, optional) – Additional key-word arguments for data decoding.
validate_kwargs (dict, optional) – Additional key-word arguments for data validation.
select_kwargs (dict, optional) – Additional key-word arguments for selecting/filtering data.

Return type:

DataBundle

Returns:

DataBundle – Container with processed data, mask, columns, dtypes, and metadata.

Notes

All kwargs are forwarded to open_data to customize the parsing, conversion, decoding, validation, and selection steps.

cdm_reader_mapper.mdf_reader.utils.parser module¶

Auxiliary functions and class for reading, converting, decoding and validating MDF files.

class cdm_reader_mapper.mdf_reader.utils.parser.OrderSpec[source]¶

Bases: TypedDict

Parsing specification for a single section.

Defines the header configuration, element layout, and parsing mode (fixed-width or delimited) for a section.

elements: dict[str, dict[str, Any]]¶

header: dict[str, Any]¶

is_delimited: bool¶

class cdm_reader_mapper.mdf_reader.utils.parser.ParserConfig(order_specs, disable_reads, dtypes, parse_dates, convert_decode, validation, encoding, columns=None)[source]¶

Bases: object

Configuration for dataset parsing.

Variables:

order_specs (dict) – Column ordering specifications.
disable_reads (list[str]) – Columns or sources to skip during parsing.
dtypes (dict) – Column data type mappings.
parse_dates (list[str]) – Columns to parse as datetimes.
convert_decode (dict) – Value conversion or decoding rules.
validation (dict) – Validation rules for parsed data.
encoding (str) – Text encoding used when reading input data.
columns (pd.Index or pd.MultiIndex or None, optional) – Explicit column index to apply. If None, inferred from input.

columns: Index | MultiIndex | None = None¶

convert_decode: dict[Any, Any]¶

disable_reads: list[str]¶

dtypes: dict[Any, Any]¶

encoding: str¶

order_specs: dict[str, OrderSpec]¶

parse_dates: list[str]¶

validation: dict[Any, Any]¶

cdm_reader_mapper.mdf_reader.utils.parser.build_parser_config(imodel=None, ext_schema_path=None, ext_schema_file=None)[source]¶

Build a ParserConfig from a normalized schema definition.

This function reads a schema definition and constructs a fully populated ParserConfig instance. The resulting configuration contains parsing order specifications, data types, converters, decoders, validation rules, and encoding information required to parse raw input records.

Parameters:

imodel (str or None, optional) – Internal model identifier used to locate the schema.
ext_schema_path (str or Path, optional) – Path to an external schema directory.
ext_schema_file (str or Path, optional) – Filename of an external schema definition.

Return type:

ParserConfig

Returns:

ParserConfig – Fully initialized parser configuration derived from the schema.

Notes

Section parsing order is derived from schema["header"]["parsing_order"].
Sections marked with disable_read=True are recorded in ParserConfig.disable_reads.
Elements marked as ignored or disabled are excluded from dtype, conversion, and validation setup.
Column indices may be strings or tuples depending on the number of sections in the schema.
Deprecated or aliased column types are normalized via _convert_dtype_to_default.
Converter and decoder functions are resolved dynamically based on column type and encoding.
Validation rules may include value ranges and code tables, as defined in the schema.

cdm_reader_mapper.mdf_reader.utils.parser.parse_netcdf(ds, order_specs, sections=None, excludes=None)[source]¶

Parse an xarray Dataset into a pandas DataFrame based on order specifications.

This function converts an xarray Dataset into a tabular pandas DataFrame according to parsing rules defined in order_specs. Data variables, dimensions, and global attributes are mapped to columns as specified, with ignored or missing elements handled automatically.

Parameters:

ds (xarray.Dataset) – Input Dataset containing data variables, dimensions, and attributes.
order_specs (dict[str, OrderSpec]) – Mapping of section names to parsing specifications. Each specification defines the header configuration, element layout, and parsing mode for a section.
sections (iterable of str or None) – Section names to include. If None, all sections are parsed.
excludes (iterable of str or None) – Section names to exclude from parsing.

Return type:

DataFrame

Returns:

pandas.DataFrame – DataFrame constructed from the Dataset according to the parsing specification. Columns are derived from element indices. Missing fields are filled with False, disabled sections with NaN, and empty strings are converted to True.

Notes

Variables, dimensions, and global attributes in ds are mapped to columns according to the element index.
Ignored elements (ignore=True) are skipped.
Disabled sections (disable_read=True) are added as columns filled with NaN.
Missing elements are added as columns filled with False.
Object-type columns are decoded from UTF-8, stripped, and empty strings replaced with True.

Examples

Example order_specs structure:

order_specs = {
    "global_attributes": {
        "header": {
            "disable_read": True,
        },
        "elements": {
            "title": {
                "index": ("global_attributes", "title"),
                "ignore": False,
                "column_type": "str",
                "missing_value": None,
            },
            "institution": {
                "index": ("global_attributes", "institution"),
                "ignore": False,
                "column_type": "str",
                "missing_value": None,
            },
        },
        "is_delimited": False,
    }
}

cdm_reader_mapper.mdf_reader.utils.parser.parse_pandas(df, order_specs, sections=None, excludes=None)[source]¶

Parse a pandas DataFrame containing raw record lines.

Each row of the input DataFrame is expected to contain a single fixed-width or delimiter-separated record, which is parsed according to the provided order specifications.

Parameters:

df (pandas.DataFrame) – Input DataFrame with exactly one column (column index 0), where each row contains a raw record string.
order_specs (dict[str, OrderSpec]) – Mapping of section names to parsing specifications. Each specification defines the header configuration, element layout, and parsing mode for a section.
sections (iterable of str or None) – Section names to include. If None, all sections are parsed.
excludes (iterable of str or None) – Section names to exclude from parsing.

Return type:

DataFrame

Returns:

pandas.DataFrame – DataFrame constructed from parsed records. Columns are derived from element indices and may be strings or tuples.

Notes

Ignored elements (ignore=True) are skipped.
Disabled sections (disable_read=True) are included as raw strings in the output.
Missing elements are filled with False.
Object-type columns are stripped, decoded from UTF-8 if necessary, and empty strings are replaced with True.
No type conversion is performed at this stage.

Examples

Example order_specs structure:

order_specs = {
    "core": {
        "header": {
            "sentinel": None,
            "length": 108,
        },
        "elements": {
            "YR": {
                "index": ("core", "YR"),
                "field_length": 4,
                "ignore": False,
                "column_type": "Int64",
                "missing_value": None,
            },
            "MO": {
                "index": ("core", "MO"),
                "field_length": 2,
                "ignore": False,
                "column_type": "Int64",
                "missing_value": None,
            },
        },
        "is_delimited": False,
    }
}

cdm_reader_mapper.mdf_reader.utils.parser.update_pd_config(pd_kwargs, config)[source]¶

Update a ParserConfig instance using pandas keyword arguments.

Currently, only the encoding option is supported. If an encoding is provided in pd_kwargs, a new ParserConfig instance is returned with the updated encoding. Otherwise, the original configuration is returned unchanged.

Parameters:

pd_kwargs (dict[str, Any]) – Keyword arguments intended for pandas I/O functions.
config (ParserConfig) – Existing parser configuration.

Return type:

ParserConfig

Returns:

ParserConfig – Updated parser configuration if applicable, otherwise the original configuration.

cdm_reader_mapper.mdf_reader.utils.parser.update_xr_config(ds, config)[source]¶

Update a ParserConfig instance using metadata from an xarray Dataset.

This function adjusts the parser configuration based on the contents of the provided Dataset. Elements not present in the Dataset are marked as ignored, and validation rules marked as "__from_file__" are populated from Dataset variable attributes when available.

Parameters:

ds (xarray.Dataset) – Input Dataset containing data variables, dimensions, and attributes.
config (ParserConfig) – Existing parser configuration.

Return type:

ParserConfig

Returns:

ParserConfig – Updated parser configuration with modified order specifications and validation rules derived from the Dataset.

cdm_reader_mapper.mdf_reader.utils.utilities module¶

Auxiliary functions and class for reading, converting, decoding and validating MDF files.

cdm_reader_mapper.mdf_reader.utils.utilities.as_list(x)[source]¶

Ensure the input is a list; keep None as None.

Parameters:: x (str, iterable, or None) – Input value to convert. Strings become single-element lists. Other iterables are converted to a list preserving iteration order. If None is passed, None is returned.
Return type:: list[Any] | None
Returns:: list or None – Converted list or None if input was None.

Notes

Sets are inherently unordered; the resulting list may not have a predictable order.

cdm_reader_mapper.mdf_reader.utils.utilities.as_path(value, name)[source]¶

Ensure the input is a Path-like object.

Parameters:

value (str or os.PathLike) – The value to convert to a Path.
name (str) – Name of the parameter, used in error messages.

Return type:

Path

Returns:

pathlib.Path – Path object representing value.

Raises:

TypeError – If value is not a string or Path-like object.

cdm_reader_mapper.mdf_reader.utils.utilities.convert_dtypes(dtypes)[source]¶

Convert datetime columns to object dtype and return columns to parse as dates.

Parameters:

dtypes (dict[str, str]) – Dictionary mapping column names to pandas dtypes.

Return type:

tuple[dict[str, str], list[str]]

Returns:

tuple –

Updated dtypes dictionary (datetime converted to object).
List of columns originally marked as datetime.

cdm_reader_mapper.mdf_reader.utils.utilities.convert_str_boolean(x)[source]¶

Convert string boolean values ‘True’/’False’ to Python booleans.

Parameters:: x (Any) – Input value.
Return type:: Any
Returns:: bool or original value – True if ‘True’, False if ‘False’, else original value.

cdm_reader_mapper.mdf_reader.utils.utilities.join(col)[source]¶

Join multi-level columns as a colon-separated string.

Parameters:: col (any or iterable of any) – A column name, which may be a single value or a list/tuple of values.
Return type:: str
Returns:: str – Colon-separated string if input is iterable, or string of the single value.

cdm_reader_mapper.mdf_reader.utils.utilities.read_csv(filepath, delimiter=',', col_subset=None, column_names=None, **kwargs)[source]¶

Safe CSV reader that handles missing files and column subsets.

Parameters:

filepath (str or Path or None) – Path to the CSV file.
delimiter (str, default ",") – Separator of CSV columns.
col_subset (list of str, optional) – Subset of columns to read from the CSV.
column_names (pd.Index or pd.MultiIndex, optional) – Column labels for re-indexing.
**kwargs (any) – Additional keyword arguments passed to pd.read_csv.

Return type:

tuple[DataFrame | Iterable[DataFrame], dict[str, Any]]

Returns:

tuple of pd.DataFrame and dict –

The CSV as a DataFrame. Empty if file does not exist.
dictionary containing data column labels and data types.

cdm_reader_mapper.mdf_reader.utils.utilities.read_feather(filepath, col_subset=None, column_names=None, **kwargs)[source]¶

Safe CSV reader that handles missing files and column subsets.

Parameters:

filepath (str or Path or None) – Path to the FEATHER file.
col_subset (list of str, optional) – Subset of columns to read from the FEATHER.
column_names (pd.Index or pd.MultiIndex, optional) – Column labels for re-indexing.
**kwargs (Any) – Additional keyword arguments passed to pd.read_feather.

Return type:

tuple[DataFrame | Iterable[DataFrame], dict[str, Any]]

Returns:

tuple of pd.DataFrame and dict –

The CSV as a DataFrame. Empty if file does not exist.
dictionary containing data column labels and data types.

cdm_reader_mapper.mdf_reader.utils.utilities.read_parquet(filepath, col_subset=None, column_names=None, **kwargs)[source]¶

Safe CSV reader that handles missing files and column subsets.

Parameters:

filepath (str or Path or None) – Path to the PARQUET file.
col_subset (list of str, optional) – Subset of columns to read from the PARQUET.
column_names (pd.Index or pd.MultiIndex, optional) – Column labels for re-indexing.
**kwargs (Any) – Additional keyword arguments passed to pd.read_parquet.

Return type:

tuple[DataFrame | Iterable[DataFrame], dict[str, Any]]

Returns:

tuple of pd.DataFrame and dict –

The PARQUET as a DataFrame. Empty if file does not exist.
dictionary containing data column labels and data types.

cdm_reader_mapper.mdf_reader.utils.utilities.remove_boolean_values(data, dtypes)[source]¶

Remove boolean values from a DataFrame and adjust dtypes.

Parameters:

data (pd.DataFrame) – Input data.
dtypes (dict) – Dictionary mapping column names to desired dtypes.

Return type:

DataFrame

Returns:

pd.DataFrame – DataFrame with booleans removed and dtype adjusted.

cdm_reader_mapper.mdf_reader.utils.utilities.update_and_select(df, subset=None, column_names=None)[source]¶

Update string column labels and select subset from DataFrame.

Parameters:

df (pd.DataFrame) – DataFrame to be updated.
subset (str or list, optional) – Column names to be selected.
column_names (pd.Index or pd.MultiIndex, optional) – Column labels for re-indexing.

Return type:

tuple[DataFrame, dict[str, Any]]

Returns:

tuple[pd.DataFrame, dict] –

The CSV as a DataFrame. Empty if file does not exist.
dictionary containing data column labels and data types

cdm_reader_mapper.mdf_reader.utils.utilities.update_column_labels(columns)[source]¶

Convert string column labels to tuples if needed, producing a pandas Index or MultiIndex.

This function attempts to parse each column label: - If the label is a string representation of a tuple (e.g., “(‘A’,’B’)”), it will be converted to a tuple. - If the label is a string containing a colon (e.g., “A:B”), it will be split into a tuple (“A”, “B”). - Otherwise, the label is left unchanged.

If all resulting labels are tuples, a pandas MultiIndex is returned. Otherwise, a regular pandas Index is returned.

Parameters:: columns (iterable of str or tuple) – Column labels to convert.
Return type:: Index | MultiIndex
Returns:: pd.Index or pd.MultiIndex – Converted column labels as a pandas Index or MultiIndex.

cdm_reader_mapper.mdf_reader.utils.utilities.update_column_names(dtypes, col_o, col_n)[source]¶

Rename a column in a dtypes dictionary if it exists.

Parameters:

dtypes (dict or str) – Dictionary mapping column names to data types, or a string.
col_o (str) – Original column name to rename.
col_n (str) – New column name.

Return type:

dict[str, Any] | str

Returns:

dict or str – Updated dictionary with column renamed, or string unchanged.

cdm_reader_mapper.mdf_reader.utils.utilities.update_dtypes(dtypes, columns)[source]¶

Filter dtypes dictionary to only include columns present in ‘columns’.

Parameters:

dtypes (dict or pd.Series) – Dictionary mapping column names to their data types.
columns (iterable of str) – List of columns to keep.

Return type:

dict[str, Any] | Series

Returns:

dict – Filtered dictionary containing only keys present in ‘columns’.

cdm_reader_mapper.mdf_reader.utils.utilities.validate_arg(arg_name, arg_value, arg_type)[source]¶

Validate that the input argument is of the expected type.

Parameters:

arg_name (str) – Name of the argument.
arg_value (Any) – Value of the argument.
arg_type (type) – Expected type of the argument.

Return type:

bool

Returns:

bool – True if arg_value is of type arg_type or None.

Raises:

ValueError – If arg_value is not of type arg_type and not None.

cdm_reader_mapper.mdf_reader.utils.validators module¶

Data validation module.

cdm_reader_mapper.mdf_reader.utils.validators.validate(data, imodel, ext_table_path, attributes, disables=None)[source]¶

Validate a pandas DataFrame according to a data model and code tables.

Each column is validated based on its column_type attribute. Supports:

Numeric types: checked against valid_min and valid_max
Keys: checked against a code table
Datetime and string: validated using simple validators
Explicit boolean literals (“True”/”False”) override column validation

Parameters:

data (pd.DataFrame) – Input data to validate.
imodel (str) – Name of the internal data model, e.g., ‘icoads_r300_d704’.
ext_table_path (str, optional) – Path to external code tables for validation.
attributes (dict[str, dict]) – Dictionary of column attributes (e.g., type, valid ranges, codetable).
disables (list[str], optional) – Columns to skip during validation.

Return type:

DataFrame | None

Returns:

pd.DataFrame – Boolean mask of the same shape as data. True indicates a valid entry.

cdm_reader_mapper.mdf_reader.utils.validators.validate_codes(series, code_table, column_type)[source]¶

Validate that entries in a pandas Series exist in a provided code table.

Missing values are treated as valid.

Parameters:

series (pd.Series) – Series of object values to validate.
code_table (iterable) – Allowed codes for validation.
column_type (str) – Column type for dtype lookup (via properties.pandas_dtypes).

Return type:

Series

Returns:

pd.Series – Boolean Series indicating valid entries.

cdm_reader_mapper.mdf_reader.utils.validators.validate_datetime(series)[source]¶

Validate that entries in a pandas Series can be converted to datetime.

Missing values are treated as valid.

Parameters:: series (pd.Series) – Series of object values to validate.
Return type:: Series
Returns:: pd.Series – Boolean Series indicating valid entries.

cdm_reader_mapper.mdf_reader.utils.validators.validate_numeric(series, valid_min, valid_max)[source]¶

Validate that entries in a pandas Series are numeric and within a range.

Converts boolean-like strings to bools.
Invalid or missing values are marked as False unless missing (NaN).

Parameters:

series (pd.Series) – Series of object values to validate.
valid_min (float) – Minimum valid value.
valid_max (float) – Maximum valid value.

Return type:

Series

Returns:

pd.Series – Boolean Series indicating valid entries.

cdm_reader_mapper.mdf_reader.utils.validators.validate_str(series)[source]¶

Validate that entries in a pandas Series are strings.

Currently all values are treated as valid.

Parameters:: series (pd.Series) – Series of object values to validate.
Return type:: Series
Returns:: pd.Series – Boolean Series with all True.