cdm_reader_mapper.mdf_reader.utils package¶
Common Data Model (CDM) reader utilities.
Submodules¶
cdm_reader_mapper.mdf_reader.utils.convert_and_decode module¶
Internal pandas converting operators.
- class cdm_reader_mapper.mdf_reader.utils.convert_and_decode.Converters(dtype)[source]¶
Bases:
objectRegistry-based converter for pandas Series.
Converts object-typed Series into numeric, datetime, or cleaned object representations based on the configured dtype.
- Parameters:
dtype (
str) – Target output dtype identifier.
- object_to_datetime(data, datetime_format='%Y%m%d')[source]¶
Convert object Series to pandas datetime.
Invalid values are coerced to NaT.
- object_to_numeric(data, scale=None, offset=None)[source]¶
Convert object Series to numeric using Decimal arithmetic.
Right spaces are treated as zeros
Optional scale and offset may be applied
Boolean values are preserved
Invalid conversions return False
- Parameters:
data (
pd.Series) – Object-typed Series.scale (
numeric, optional) – Scale factor.offset (
numeric, optional) – Offset value.
- Return type:
- Returns:
pd.Series– Converted Series.
- class cdm_reader_mapper.mdf_reader.utils.convert_and_decode.Decoders(dtype, encoding='base36')[source]¶
Bases:
objectRegistry-based decoder dispatcher for column-wise decoding.
Currently supports Base36 decoding for numeric-like fields.
- Parameters:
- base36(data)[source]¶
Decode a pandas Series from Base36 to stringified base-10 integers.
Boolean values are preserved. Invalid values raise ValueError via int(…, 36).
- Parameters:
data (
pd.Series) – Input Series containing base36-encoded values.- Return type:
- Returns:
pd.Series– Decoded Series with stringified integers or booleans.
- cdm_reader_mapper.mdf_reader.utils.convert_and_decode.convert_and_decode(data, convert_flag=True, decode_flag=True, converter_dict=None, converter_kwargs=None, decoder_dict=None)[source]¶
Convert and decode data entries by using a pre-defined data model.
Overwrite attribute data with converted and/or decoded data.
- Parameters:
data (
pd.DataFrame) – Data to convert and decode.convert_flag (
bool, defaultTrue) – If True, apply converters to the columns defined in converter_dict.decode_flag (
bool, defaultTrue) – If True, apply decoders to the columns defined in decoder_dict.converter_dict (
dict[str,callable], optional) – Column-specific converter functions. If None, defaults to empty dict.converter_kwargs (
dict[str,dict], optional) – Keyword arguments for each converter function.decoder_dict (
dict[str,callable], optional) – Column-specific decoder functions. If None, defaults to empty dict.
- Return type:
- Returns:
pd.DataFrame– DataFrame with converted and decoded columns.
- cdm_reader_mapper.mdf_reader.utils.convert_and_decode.max_decimal_places(*decimals)[source]¶
Return the maximum number of decimal places among Decimal values.
- cdm_reader_mapper.mdf_reader.utils.convert_and_decode.to_numeric(x, scale, offset)[source]¶
Convert a value to a scaled Decimal with offset applied.
- Parameters:
x (
Any) – Input value to convert.scale (
Decimal) – Scale factor.offset (
Decimal) – Offset value.
- Return type:
- Returns:
Decimal | bool– Converted Decimal value, boolean, or False if invalid.
Notes
Boolean values are returned unchanged
Empty or invalid values return False
Strings are stripped and spaces replaced with zeros
Result is quantized to the maximum decimal precision of input, scale, or offset
cdm_reader_mapper.mdf_reader.utils.filereader module¶
Auxiliary functions and class for reading, converting, decoding and validating MDF files.
- class cdm_reader_mapper.mdf_reader.utils.filereader.FileReader(imodel, ext_schema_path=None, ext_schema_file=None)[source]¶
Bases:
objectClass to read marine-meteorological data.
Provides a high-level interface to read, parse, filter, convert, decode, and validate data from multiple sources (FWF, CSV, NetCDF).
- Parameters:
- open_data(source, open_with='pandas', pd_kwargs=None, xr_kwargs=None, convert_kwargs=None, decode_kwargs=None, validate_kwargs=None, select_kwargs=None)[source]¶
Open and parse source data according to parser configuration.
- Parameters:
source (
str) – Path or pattern for input file(s).open_with (
str) – Parser backend: ‘pandas’ or ‘netcdf’.pd_kwargs (
dict, optional) – Additional key-word arguments for parsing pandas-readable data.xr_kwargs (
dict, optional) – Additional key-word arguments for parsing xarray-readable data.convert_kwargs (
dict, optional) – Additional key-word arguments for data conversion.decode_kwargs (
dict, optional) – Additional key-word arguments for data decoding.validate_kwargs (
dict, optional) – Additional key-word arguments for data validation.select_kwargs (
dict, optional) – Additional key-word arguments for selecting/filtering data.
- Return type:
tuple[DataFrame,DataFrame,ParserConfig] |tuple[Iterable[DataFrame],Iterable[DataFrame],ParserConfig]- Returns:
tuple– (data, mask, config) or chunked equivalents if using Iterable[pd.DataFrame].
- read(source, pd_kwargs=None, xr_kwargs=None, convert_kwargs=None, decode_kwargs=None, validate_kwargs=None, select_kwargs=None)[source]¶
Read and process data from the given source.
- Parameters:
source (
str) – Path to input file(s).pd_kwargs (
dict, optional) – Additional key-word arguments for parsing pandas-readable data.xr_kwargs (
dict, optional) – Additional key-word arguments for parsing xarray-readable data.convert_kwargs (
dict, optional) – Additional key-word arguments for data conversion.decode_kwargs (
dict, optional) – Additional key-word arguments for data decoding.validate_kwargs (
dict, optional) – Additional key-word arguments for data validation.select_kwargs (
dict, optional) – Additional key-word arguments for selecting/filtering data.
- Return type:
- Returns:
DataBundle– Container with processed data, mask, columns, dtypes, and metadata.
Notes
All kwargs are forwarded to
open_datato customize the parsing, conversion, decoding, validation, and selection steps.
cdm_reader_mapper.mdf_reader.utils.parser module¶
Auxiliary functions and class for reading, converting, decoding and validating MDF files.
- class cdm_reader_mapper.mdf_reader.utils.parser.OrderSpec[source]¶
Bases:
TypedDictParsing specification for a single section.
Defines the header configuration, element layout, and parsing mode (fixed-width or delimited) for a section.
- class cdm_reader_mapper.mdf_reader.utils.parser.ParserConfig(order_specs, disable_reads, dtypes, parse_dates, convert_decode, validation, encoding, columns=None)[source]¶
Bases:
objectConfiguration for dataset parsing.
- Variables:
order_specs (
dict) – Column ordering specifications.disable_reads (
list[str]) – Columns or sources to skip during parsing.dtypes (
dict) – Column data type mappings.parse_dates (
list[str]) – Columns to parse as datetimes.convert_decode (
dict) – Value conversion or decoding rules.validation (
dict) – Validation rules for parsed data.encoding (
str) – Text encoding used when reading input data.columns (
pd.Indexorpd.MultiIndexorNone, optional) – Explicit column index to apply. If None, inferred from input.
- cdm_reader_mapper.mdf_reader.utils.parser.build_parser_config(imodel=None, ext_schema_path=None, ext_schema_file=None)[source]¶
Build a ParserConfig from a normalized schema definition.
This function reads a schema definition and constructs a fully populated
ParserConfiginstance. The resulting configuration contains parsing order specifications, data types, converters, decoders, validation rules, and encoding information required to parse raw input records.- Parameters:
- Return type:
- Returns:
ParserConfig– Fully initialized parser configuration derived from the schema.
Notes
Section parsing order is derived from
schema["header"]["parsing_order"].Sections marked with
disable_read=Trueare recorded inParserConfig.disable_reads.Elements marked as ignored or disabled are excluded from dtype, conversion, and validation setup.
Column indices may be strings or tuples depending on the number of sections in the schema.
Deprecated or aliased column types are normalized via
_convert_dtype_to_default.Converter and decoder functions are resolved dynamically based on column type and encoding.
Validation rules may include value ranges and code tables, as defined in the schema.
- cdm_reader_mapper.mdf_reader.utils.parser.parse_netcdf(ds, order_specs, sections=None, excludes=None)[source]¶
Parse an xarray Dataset into a pandas DataFrame based on order specifications.
This function converts an xarray Dataset into a tabular pandas DataFrame according to parsing rules defined in order_specs. Data variables, dimensions, and global attributes are mapped to columns as specified, with ignored or missing elements handled automatically.
- Parameters:
ds (
xarray.Dataset) – Input Dataset containing data variables, dimensions, and attributes.order_specs (
dict[str,OrderSpec]) – Mapping of section names to parsing specifications. Each specification defines the header configuration, element layout, and parsing mode for a section.sections (
iterableofstrorNone) – Section names to include. If None, all sections are parsed.excludes (
iterableofstrorNone) – Section names to exclude from parsing.
- Return type:
- Returns:
pandas.DataFrame– DataFrame constructed from the Dataset according to the parsing specification. Columns are derived from element indices. Missing fields are filled with False, disabled sections with NaN, and empty strings are converted to True.
Notes
Variables, dimensions, and global attributes in ds are mapped to columns according to the element index.
Ignored elements (ignore=True) are skipped.
Disabled sections (disable_read=True) are added as columns filled with NaN.
Missing elements are added as columns filled with False.
Object-type columns are decoded from UTF-8, stripped, and empty strings replaced with True.
Examples
Example
order_specsstructure:order_specs = { "global_attributes": { "header": { "disable_read": True, }, "elements": { "title": { "index": ("global_attributes", "title"), "ignore": False, "column_type": "str", "missing_value": None, }, "institution": { "index": ("global_attributes", "institution"), "ignore": False, "column_type": "str", "missing_value": None, }, }, "is_delimited": False, } }
- cdm_reader_mapper.mdf_reader.utils.parser.parse_pandas(df, order_specs, sections=None, excludes=None)[source]¶
Parse a pandas DataFrame containing raw record lines.
Each row of the input DataFrame is expected to contain a single fixed-width or delimiter-separated record, which is parsed according to the provided order specifications.
- Parameters:
df (
pandas.DataFrame) – Input DataFrame with exactly one column (column index0), where each row contains a raw record string.order_specs (
dict[str,OrderSpec]) – Mapping of section names to parsing specifications. Each specification defines the header configuration, element layout, and parsing mode for a section.sections (
iterableofstrorNone) – Section names to include. If None, all sections are parsed.excludes (
iterableofstrorNone) – Section names to exclude from parsing.
- Return type:
- Returns:
pandas.DataFrame– DataFrame constructed from parsed records. Columns are derived from element indices and may be strings or tuples.
Notes
Ignored elements (
ignore=True) are skipped.Disabled sections (
disable_read=True) are included as raw strings in the output.Missing elements are filled with
False.Object-type columns are stripped, decoded from UTF-8 if necessary, and empty strings are replaced with
True.No type conversion is performed at this stage.
Examples
Example
order_specsstructure:order_specs = { "core": { "header": { "sentinel": None, "length": 108, }, "elements": { "YR": { "index": ("core", "YR"), "field_length": 4, "ignore": False, "column_type": "Int64", "missing_value": None, }, "MO": { "index": ("core", "MO"), "field_length": 2, "ignore": False, "column_type": "Int64", "missing_value": None, }, }, "is_delimited": False, } }
- cdm_reader_mapper.mdf_reader.utils.parser.update_pd_config(pd_kwargs, config)[source]¶
Update a ParserConfig instance using pandas keyword arguments.
Currently, only the
encodingoption is supported. If an encoding is provided inpd_kwargs, a new ParserConfig instance is returned with the updated encoding. Otherwise, the original configuration is returned unchanged.- Parameters:
pd_kwargs (
dict[str,Any]) – Keyword arguments intended for pandas I/O functions.config (
ParserConfig) – Existing parser configuration.
- Return type:
- Returns:
ParserConfig– Updated parser configuration if applicable, otherwise the original configuration.
- cdm_reader_mapper.mdf_reader.utils.parser.update_xr_config(ds, config)[source]¶
Update a ParserConfig instance using metadata from an xarray Dataset.
This function adjusts the parser configuration based on the contents of the provided Dataset. Elements not present in the Dataset are marked as ignored, and validation rules marked as
"__from_file__"are populated from Dataset variable attributes when available.- Parameters:
ds (
xarray.Dataset) – Input Dataset containing data variables, dimensions, and attributes.config (
ParserConfig) – Existing parser configuration.
- Return type:
- Returns:
ParserConfig– Updated parser configuration with modified order specifications and validation rules derived from the Dataset.
cdm_reader_mapper.mdf_reader.utils.utilities module¶
Auxiliary functions and class for reading, converting, decoding and validating MDF files.
- cdm_reader_mapper.mdf_reader.utils.utilities.as_list(x)[source]¶
Ensure the input is a list; keep None as None.
- Parameters:
x (
str,iterable, orNone) – Input value to convert. Strings become single-element lists. Other iterables are converted to a list preserving iteration order. If None is passed, None is returned.- Return type:
- Returns:
Notes
Sets are inherently unordered; the resulting list may not have a predictable order.
- cdm_reader_mapper.mdf_reader.utils.utilities.as_path(value, name)[source]¶
Ensure the input is a Path-like object.
- Parameters:
value (
stroros.PathLike) – The value to convert to a Path.name (
str) – Name of the parameter, used in error messages.
- Return type:
- Returns:
pathlib.Path– Path object representing value.- Raises:
TypeError – If value is not a string or Path-like object.
- cdm_reader_mapper.mdf_reader.utils.utilities.convert_dtypes(dtypes)[source]¶
Convert datetime columns to object dtype and return columns to parse as dates.
- cdm_reader_mapper.mdf_reader.utils.utilities.convert_str_boolean(x)[source]¶
Convert string boolean values ‘True’/’False’ to Python booleans.
- cdm_reader_mapper.mdf_reader.utils.utilities.join(col)[source]¶
Join multi-level columns as a colon-separated string.
- cdm_reader_mapper.mdf_reader.utils.utilities.read_csv(filepath, delimiter=',', col_subset=None, column_names=None, **kwargs)[source]¶
Safe CSV reader that handles missing files and column subsets.
- Parameters:
- Return type:
- Returns:
tupleofpd.DataFrameanddict–The CSV as a DataFrame. Empty if file does not exist.
dictionary containing data column labels and data types.
- cdm_reader_mapper.mdf_reader.utils.utilities.read_feather(filepath, col_subset=None, column_names=None, **kwargs)[source]¶
Safe CSV reader that handles missing files and column subsets.
- cdm_reader_mapper.mdf_reader.utils.utilities.read_parquet(filepath, col_subset=None, column_names=None, **kwargs)[source]¶
Safe CSV reader that handles missing files and column subsets.
- Parameters:
- Return type:
- Returns:
tupleofpd.DataFrameand dict –The PARQUET as a DataFrame. Empty if file does not exist.
dictionary containing data column labels and data types.
- cdm_reader_mapper.mdf_reader.utils.utilities.remove_boolean_values(data, dtypes)[source]¶
Remove boolean values from a DataFrame and adjust dtypes.
- cdm_reader_mapper.mdf_reader.utils.utilities.update_and_select(df, subset=None, column_names=None)[source]¶
Update string column labels and select subset from DataFrame.
- Parameters:
- Return type:
- Returns:
tuple[pd.DataFrame,dict]–The CSV as a DataFrame. Empty if file does not exist.
dictionary containing data column labels and data types
- cdm_reader_mapper.mdf_reader.utils.utilities.update_column_labels(columns)[source]¶
Convert string column labels to tuples if needed, producing a pandas Index or MultiIndex.
This function attempts to parse each column label: - If the label is a string representation of a tuple (e.g., “(‘A’,’B’)”), it will be converted to a tuple. - If the label is a string containing a colon (e.g., “A:B”), it will be split into a tuple (“A”, “B”). - Otherwise, the label is left unchanged.
If all resulting labels are tuples, a pandas MultiIndex is returned. Otherwise, a regular pandas Index is returned.
- cdm_reader_mapper.mdf_reader.utils.utilities.update_column_names(dtypes, col_o, col_n)[source]¶
Rename a column in a dtypes dictionary if it exists.
- cdm_reader_mapper.mdf_reader.utils.utilities.update_dtypes(dtypes, columns)[source]¶
Filter dtypes dictionary to only include columns present in ‘columns’.
- cdm_reader_mapper.mdf_reader.utils.utilities.validate_arg(arg_name, arg_value, arg_type)[source]¶
Validate that the input argument is of the expected type.
- Parameters:
- Return type:
- Returns:
bool– True if arg_value is of type arg_type or None.- Raises:
ValueError – If arg_value is not of type arg_type and not None.
cdm_reader_mapper.mdf_reader.utils.validators module¶
Data validation module.
- cdm_reader_mapper.mdf_reader.utils.validators.validate(data, imodel, ext_table_path, attributes, disables=None)[source]¶
Validate a pandas DataFrame according to a data model and code tables.
- Each column is validated based on its column_type attribute. Supports:
Numeric types: checked against valid_min and valid_max
Keys: checked against a code table
Datetime and string: validated using simple validators
Explicit boolean literals (“True”/”False”) override column validation
- Parameters:
data (
pd.DataFrame) – Input data to validate.imodel (
str) – Name of the internal data model, e.g., ‘icoads_r300_d704’.ext_table_path (
str, optional) – Path to external code tables for validation.attributes (
dict[str,dict]) – Dictionary of column attributes (e.g., type, valid ranges, codetable).disables (
list[str], optional) – Columns to skip during validation.
- Return type:
- Returns:
pd.DataFrame– Boolean mask of the same shape as data. True indicates a valid entry.
- cdm_reader_mapper.mdf_reader.utils.validators.validate_codes(series, code_table, column_type)[source]¶
Validate that entries in a pandas Series exist in a provided code table.
Missing values are treated as valid.
- cdm_reader_mapper.mdf_reader.utils.validators.validate_datetime(series)[source]¶
Validate that entries in a pandas Series can be converted to datetime.
Missing values are treated as valid.
- Parameters:
series (
pd.Series) – Series of object values to validate.- Return type:
- Returns:
pd.Series– Boolean Series indicating valid entries.
- cdm_reader_mapper.mdf_reader.utils.validators.validate_numeric(series, valid_min, valid_max)[source]¶
Validate that entries in a pandas Series are numeric and within a range.
Converts boolean-like strings to bools.
Invalid or missing values are marked as False unless missing (NaN).
- cdm_reader_mapper.mdf_reader.utils.validators.validate_str(series)[source]¶
Validate that entries in a pandas Series are strings.
Currently all values are treated as valid.
- Parameters:
series (
pd.Series) – Series of object values to validate.- Return type:
- Returns:
pd.Series– Boolean Series with all True.