cdm_reader_mapper package

Common Data Model (CDM) reader and mapper package.

class cdm_reader_mapper.DataBundle(data=None, columns=None, dtypes=None, parse_dates=None, encoding=None, mask=None, imodel=None, mode='data')[source]

Bases: object

Container for tabular data and associated metadata.

This class wraps either an in-memory pd.DataFrame or a ParquetStreamReader for chunked, disk-backed processing. It provides a unified interface for accessing DataFrame-like attributes and methods, transparently handling streaming data where required.

Parameters:
  • data (pandas.DataFrame or Iterable[pandas.DataFrame] or ParquetStreamReader, optional) – Input data. If an iterable is provided, it is converted into a ParquetStreamReader for streaming.

  • columns (pandas.Index or pandas.MultiIndex or list, optional) – Column labels used when initializing empty data.

  • dtypes (pandas.Series or dict, optional) – Data types for columns.

  • parse_dates (list or bool, optional) – Instructions for parsing dates.

  • encoding (str, optional) – Encoding associated with the data.

  • mask (pandas.DataFrame or Iterable[pandas.DataFrame] or ParquetStreamReader, optional) – Boolean mask aligned with data. If not provided, an empty mask is created.

  • imodel (str, optional) – Name of the input data model.

  • mode ({"data", "tables"}, default "data") – Data representation mode.

Examples

Getting a DataBundle while reading data from disk.

>>> from cdm_reader_mapper import read_mdf
>>> db = read_mdf(source="file_on_disk", imodel="custom_model_name")

Constructing a DataBundle from already read MDf data.

>>> from cdm_reader_mapper import DataBundle
>>> read = read_mdf(source="file_on_disk", imodel="custom_model_name")
>>> data_ = read.data
>>> mask_ = read.mask
>>> db = DataBundle(data=data_, mask=mask_)

Constructing a DataBundle from already read CDM data.

>>> from cdm_reader_mapper import read_tables
>>> tables = read_tables("path_to_files").data
>>> db = DataBundle(data=tables, mode="tables")
add(addition, inplace=False)[source]

Adding information to a DataBundle.

Parameters:
Return type:

DataBundle | None

Returns:

DataBundle or None – DataBundle with added information or None if “inplace=True”.

Examples

>>> tables = read_tables("path_to_files")
>>> db = db.add({"data": tables})
property columns: pandas.core.indexes.base.Index | pandas.core.indexes.multi.MultiIndex

Column labels of data.

Returns:

pd.Index or pd.MultiIndex – Column labels of the underlying MDf data.

copy()[source]

Make deep copy of a DataBundle.

Return type:

DataBundle

Returns:

DataBundle – Copy of a DataBundle.

Examples

>>> db2 = db.copy()
correct_datetime(imodel=None, inplace=False, **kwargs)[source]

Correct datetime information in data.

Parameters:
  • imodel (str, optional) – Name of the MFD/CDM data model.

  • inplace (bool, default: False) – If True overwrite data in DataBundle else return a copy of DataBundle with datetime-corrected values in data.

  • **kwargs (Any) – Additional keyword-arguments for correcting datetime.

Return type:

DataBundle | None

Returns:

DataBundle or None – DataBundle with corrected datetime information or None if “inplace=True”.

See also

DataBundle.correct_pt

Correct platform type information in data.

DataBundle.validate_datetime

Validate datetime information in data.

DataBundle.validate_id

Validate station id information in data.

Notes

For more information see correct_datetime()

Examples

>>> df_dt = db.correct_datetime()
correct_pt(imodel=None, inplace=False, **kwargs)[source]

Correct platform type information in data.

Parameters:
  • imodel (str, optional) – Name of the MFD/CDM data model.

  • inplace (bool, default: True) – If True overwrite data in DataBundle else return a copy of DataBundle with platform-corrected values in data.

  • **kwargs (Any) – Additional keyword-arguments for correcting platform type.

Return type:

DataBundle | None

Returns:

DataBundle or None – DataBundle with corrected platform type information or None if “inplace=True”.

See also

DataBundle.correct_datetime

Correct datetime information in data.

DataBundle.validate_id

Validate station id information in data.

DataBundle.validate_datetime

Validate datetime information in data.

Notes

For more information see correct_pt()

Examples

>>> df_pt = db.correct_pt()
property data: pandas.core.frame.DataFrame | ParquetStreamReader

Underlying MDF data.

Returns:

pd.DataFrame or ParquetStreamReader – Underlying MDf data.

property dtypes: pandas.core.series.Series | dict[str, Any] | None

Dictionary of data types on data.

Returns:

pd.Series or dict or None – Data types of underlying MDF data.

duplicate_check(inplace=False, **kwargs)[source]

Duplicate check in data.

Parameters:
  • inplace (bool, default: False) – If True overwrite data in DataBundle else return a copy of DataBundle with data as CDM tables.

  • **kwargs (Any) – Additional keyword-arguments for duplicate check.

Return type:

DataBundle | None

Returns:

DataBundle or None – DataBundle containing new DupDetect class for further duplicate check methods or None if “inplace=True”.

See also

DataBundle.get_duplicates

Get duplicate matches in data.

DataBundle.flag_duplicates

Flag detected duplicates in data.

DataBundle.remove_duplicates

Remove detected duplicates in data.

Notes

Following columns have to be provided:

  • longitude

  • latitude

  • primary_station_id

  • report_timestamp

  • station_course

  • station_speed

This adds a new class DupDetect to DataBundle. This class is necessary for further duplicate check methods.

For more information see duplicate_check()

Examples

>>> db.duplicate_check()
property encoding: str | None

A string representing the encoding to use in the data.

Returns:

str or None – String representing the encoding to use in the underlying MDF data.

See also

pd.to_csv()

Write data with encoding to CSV file.

flag_duplicates(inplace=False, **kwargs)[source]

Flag detected duplicates in data.

Parameters:
  • inplace (bool, default: False) – If True overwrite data in DataBundle else return a copy of DataBundle with data containing flagged duplicates.

  • **kwargs (Any) – Additional keyword-arguments for flagging duplicates.

Return type:

DataBundle | None

Returns:

DataBundle or None – DataBundle containing duplicate flags in data or None if “inplace=True”.

Raises:

RuntimeError – Before flagging duplicates, a duplictate check has to be done, DataBundle.duplicate_check().

See also

DataBundle.remove_duplicates

Remove detected duplicates in data.

DataBundle.get_duplicates

Get duplicate matches in data.

DataBundle.duplicate_check

Duplicate check in data.

Notes

For more information see DupDetect.flag_duplicates()

Examples

Flag duplicates without overwriting data.

>>> flagged_tables = db.flag_duplicates()

Flag duplicates with overwriting data.

>>> db.flag_duplicates(inplace=True)
>>> flagged_tables = db.data
get_duplicates(**kwargs)[source]

Get duplicate matches in data.

Parameters:

**kwargs (Any) – Additional keyword-arguments used for getting duplicates.

Return type:

DataFrame

Returns:

pd.DataFrame – DataFrame containing duplicate matches.

Raises:

RuntimeError – Before getting duplicates, a duplictate check has to be done, DataBundle.duplicate_check().

See also

DataBundle.remove_duplicates

Remove detected duplicates in data.

DataBundle.flag_duplicates

Flag detected duplicates in data.

DataBundle.duplicate_check

Duplicate check in data.

Notes

For more information see DupDetect.get_duplicates()

Examples

>>> matches = db.get_duplicates()
property imodel: str | None

Name of the MDF/CDM input model.

Returns:

str or None – Name of the MDF/CDM input model if available.

map_model(imodel=None, inplace=False, **kwargs)[source]

Map data to the Common Data Model.

Parameters:
  • imodel (str, optional) – Name of the MFD/CDM data model.

  • inplace (bool, default: False) – If True overwrite data in DataBundle else return a copy of DataBundle with data as CDM tables.

  • **kwargs (Any) – Additional keyword-arguments for mapping to CDM.

Return type:

DataBundle | None

Returns:

DataBundle or None – DataBundle containing data mapped to the CDM or None if inplace=True.

Notes

For more information see map_model()

Examples

>>> cdm_tables = db.map_model()
property mask: pandas.core.frame.DataFrame | ParquetStreamReader

MDF validation mask.

Returns:

pd.DataFrame or ParquetStreamReader – Validation mask of the underlying MDF data.

property mode: str

Data mode.

Returns:

str – Current data mode.

Raises:

TypeError – If mode of the underlying data is not a string.

property parse_dates: list[Any] | bool | None

Information of how to parse dates in data.

Returns:

list or bool or None – Information of how to parse dates in underlying MDF data.

See also

pd.read_csv()

Read CSV file using pandas.

remove_duplicates(inplace=False, **kwargs)[source]

Remove detected duplicates in data.

Parameters:
  • inplace (bool, default: False) – If True overwrite data in DataBundle else return a copy of DataBundle with data containing no duplicates.

  • **kwargs (Any) – Additional keyword-arguments used to remove duplicates.

Return type:

DataBundle | None

Returns:

DataBundle or None – DataBundle without duplicated rows or None if “inplace=True”.

Raises:

RuntimeError – Before removing duplicates, a duplictate check has to be done, DataBundle.duplicate_check().

See also

DataBundle.flag_duplicates

Flag detected duplicates in data.

DataBundle.get_duplicates

Get duplicate matches in data.

DataBundle.duplicate_check

Duplicate check in data.

Notes

For more information see DupDetect.remove_duplicates()

Examples

Remove duplicates without overwriting data.

>>> removed_tables = db.remove_duplicates()

Remove duplicates with overwriting data.

>>> db.remove_duplicates(inplace=True)
>>> removed_tables = db.data
replace_columns(df_corr, subset=None, inplace=False, **kwargs)[source]

Replace columns in data.

Parameters:
  • df_corr (pd.DataFrame) – Data to be inplaced.

  • subset (str or list of str, optional) – Select subset by columns. This option is useful for multi-indexed data.

  • inplace (bool, default: False) – If True overwrite data in DataBundle else return a copy of DataBundle with replaced column names in data.

  • **kwargs (Any) – Additional keyword-arguments for replacing columns.

Return type:

DataBundle | None

Returns:

DataBundle or None – DataBundle with replaced column names or None if “inplace=True”.

Notes

For more information see replace_columns()

Examples

>>> import pandas as pd
>>> df_corr = pd.read_csv("correction_file_on_disk")
>>> df_repl = db.replace_columns(df_corr)
select_where_all_false(inplace=False, do_mask=True, **kwargs)[source]

Select rows from data where all column entries in mask are False.

Parameters:
  • inplace (bool, default: False) – If True overwrite data in DataBundle else return a copy of DataBundle with invalid values only in data.

  • do_mask (bool, default: True) – If True also do selection on mask.

  • **kwargs (Any) – Additional keyword-arguments for splitting data where all entries are False.

Return type:

DataBundle | None

Returns:

DataBundle or None – DataBundle containing rows where all column entries in mask are False or None if inplace=True.

See also

DataBundle.select_where_all_true

Select rows from data where all entries in mask are True.

DataBundle.select_where_entry_isin

Select rows from data where column entries are in a specific value list.

DataBundle.select_where_index_isin

Select rows from data within specific index list.

Notes

For more information see split_by_boolean_false()

Examples

Select without overwriting the old data.

>>> db_selected = db.select_where_all_false()

Select valid values only with overwriting the old data.

>>> db.select_where_all_false(inplace=True)
>>> df_selected = db.data
select_where_all_true(inplace=False, do_mask=True, **kwargs)[source]

Select rows from data where all column entries in mask are True.

Parameters:
  • inplace (bool, default: False) – If True overwrite data in DataBundle else return a copy of DataBundle with valid values only in data.

  • do_mask (bool, default: True) – If True also do selection on mask.

  • **kwargs (Any) – Additional keyword-arguments for splitting data where all entries are True.

Return type:

DataBundle | None

Returns:

DataBundle or None – DataBundle containing rows where all column entries in mask are True or None if inplace=True.

See also

DataBundle.select_where_all_false

Select rows from data where all entries in mask are False.

DataBundle.select_where_entry_isin

Select rows from data where column entries are in a specific value list.

DataBundle.select_where_index_isin

Select rows from data within specific index list.

Notes

For more information see split_by_boolean_true()

Examples

Select without overwriting the old data.

>>> db_selected = db.select_where_all_true()

Select overwriting the old data.

>>> db.select_where_all_true(inplace=True)
>>> df_selected = db.data
select_where_entry_isin(selection, inplace=False, do_mask=True, **kwargs)[source]

Select rows from data where column entries are in a specific value list.

Parameters:
  • selection (dict) – Keys: Column names in data. Values: Specific value list.

  • inplace (bool, default: False) – If True overwrite data in DataBundle else return a copy of DataBundle with selected columns only in data.

  • do_mask (bool, default: True) – If True also do selection on mask.

  • **kwargs (Any) – Additional keyword-arguments for splitting data where entries within a specific value list.

Return type:

DataBundle | None

Returns:

DataBundle or None – DataBundle containing rows where column entries are in a specific value list or None if inplace=True.

See also

DataBundle.select_where_index_isin

Select rows from data within specific index list.

DataBundle.select_where_all_true

Select rows from data where all entries in mask are True.

DataBundle.select_where_all_false

Select rows from data where all entries in mask are False.

Notes

For more information see split_by_column_entries()

Examples

Select without overwriting the old data.

>>> db_selected = db.select_where_entry_isin(
...     selection={("c1", "B1"): [26, 41]},
... )

Select with overwriting the old data.

>>> db.select_where_entry_isin(selection={("c1", "B1"): [26, 41]}, inplace=True)
>>> df_selected = db.data
select_where_index_isin(index, inplace=False, do_mask=True, **kwargs)[source]

Select rows from data where indexes within a specific index list.

Parameters:
  • index (list of int) – Specific index list.

  • inplace (bool, default: False) – If True overwrite data in DataBundle else return a copy of DataBundle with selected rows only in data.

  • do_mask (bool, default: True) – If True also do selection on mask.

  • **kwargs (Any) – Additional keyword-arguments for splitting data where indexes within a specific index list.

Return type:

DataBundle | None

Returns:

DataBundle or None – DataBundle containing rows where indexes are within a specific index list or None if inplace=True.

See also

DataBundle.select_where_entry_isin

Select rows from data where column entries are in a specific value list.

DataBundle.select_where_all_true

Select rows from data where all entries in mask are True.

DataBundle.select_where_all_false

Select rows from data where all entries in mask are False.

Notes

For more information see split_by_index()

Examples

Select without overwriting the old data.

>>> db_selected = db.select_where_index_isin([0, 2, 4])

Select with overwriting the old data.

>>> db.select_where_index_isin(index=[0, 2, 4], inplace=True)
>>> df_selected = db.data
split_by_boolean_false(do_mask=True, **kwargs)[source]

Split data by rows where all column entries in mask are False.

Parameters:
  • do_mask (bool, default: True) – If True also do selection on mask.

  • **kwargs (Any) – Additional keyword-arguments for splitting data where mask is False.

Return type:

tuple[DataBundle, DataBundle]

Returns:

tuple – First DataBundle including rows where all column entries in mask are False. Second DataBundle including rows where all column entries in mask are True.

See also

DataBundle.split_by_boolean_false

Split data by rows where all entries in mask are True.

DataBundle.split_by_column_entries

Split data by rows where column entries are in a specific value list.

DataBundle.split_by_index

Split data by rows within specific index list.

Notes

For more information see split_by_boolean_false()

Examples

Split DataBundle.

>>> db_false, db_true = db.split_by_boolean_false()
split_by_boolean_true(do_mask=True, **kwargs)[source]

Split data by rows where all column entries in mask are True.

Parameters:
  • do_mask (bool, default: True) – If True also do selection on mask.

  • **kwargs (Any) – Additional keyword-arguments for splitting data where mask is False.

Return type:

tuple[DataBundle, DataBundle]

Returns:

tuple – First DataBundle including rows where all column entries in mask are True. Second DataBundle including rows where all column entries in mask are False.

See also

DataBundle.split_by_boolean_false

Split data by rows where all entries in mask are False.

DataBundle.split_by_column_entries

Split data by rows where column entries are in a specific value list.

DataBundle.split_by_index

Split data by rows within specific index list.

Notes

For more information see split_by_boolean_true()

Examples

Split DataBundle.

>>> db_true, db_false = db.split_by_boolean_true()
split_by_column_entries(selection, do_mask=True, **kwargs)[source]

Split data by rows where column entries are in a specific value list.

Parameters:
  • selection (dict) – Keys: Column names in data. Values: Specific value list.

  • do_mask (bool, default: True) – If True also do selection on mask.

  • **kwargs (Any) – Additional keyword-arguments for splitting data by column entries.

Return type:

tuple[DataBundle, DataBundle]

Returns:

tuple – First DataBundle including rows where column entries are in a specific value list. Second DataBundle including rows where column entries are not in a specific value list.

See also

DataBundle.split_by_index

Split data by rows within specific index list.

DataBundle.split_by_boolean_true

Split data by rows where all entries in mask are True.

DataBundle.split_by_boolean_false

Split data by rows where all entries in mask are False.

Notes

For more information see split_by_column_entries()

Examples

Split DataBundle.

>>> db_isin, db_isnotin = db.split_by_column_entries(
...     selection={("c1", "B1"): [26, 41]},
... )
split_by_index(index, do_mask=True, **kwargs)[source]

Split data by rows within specific index list.

Parameters:
  • index (list of int) – Specific index list.

  • do_mask (bool, default: True) – If True also do selection on mask.

  • **kwargs (Any) – Additional keyword-arguments for splitting data by index.

Return type:

tuple[DataBundle, DataBundle]

Returns:

tuple – First DataBundle including rows within specific index list. Second DataBundle including rows outside specific index list.

See also

DataBundle.split_by_column_entries

Select columns from data with specific values.

DataBundle.split_by_boolean_true

Split data by rows where all entries in mask are True.

DataBundle.split_by_boolean_false

Split data by rows where all entries in mask are False.

Notes

For more information see split_by_index()

Examples

Split DataBundle.

>>> db_isin, db_isnotin = db.split_by_index([0, 2, 4])
stack_h(other, datasets=('data', 'mask'), inplace=False, **kwargs)[source]

Stack multiple DataBundle’s horizontally.

Parameters:
  • other (DataBundle or Sequence of DataBundle) – List of other DataBundle to stack horizontally.

  • datasets (str or Sequence of str, default: [data, mask]) – List of datasets to be stacked.

  • inplace (bool, default: False) – If True overwrite datasets in DataBundle else return a copy of DataBundle with stacked datasets.

  • **kwargs (Any) – Additional keyword-arguments for stacking DataFrames horizontally.

Return type:

DataBundle | None

Returns:

DataBundle or None – Horizontally stacked DataBundle or None if inplace=True.

See also

DataBundle.stack_v

Stack multiple DataBundle’s vertically.

Notes

  • This is only working with pd.DataFrames, not with iterables of pd.DataFrames!

  • The DataFrames in the DataBundle may have different data columns!

Examples

>>> db = db1.stack_h(db2, datasets=["data", "mask"])
stack_v(other, datasets=('data', 'mask'), inplace=False, **kwargs)[source]

Stack multiple DataBundle’s vertically.

Parameters:
  • other (DataBundle or Sequence of DataBundle) – List of other DataBundle to stack vertically.

  • datasets (str or Sequence of str, default: (data, mask)) – List of datasets to be stacked.

  • inplace (bool, default: False) – If True overwrite datasets in DataBundle else return a copy of DataBundle with stacked datasets.

  • **kwargs (Any) – Additional keyword-arguments for stacking DataFrames vertically.

Return type:

DataBundle | None

Returns:

DataBundle or None – Vertically stacked DataBundle or None if “inplace=True”.

See also

DataBundle.stack_h

Stack multiple DataBundle’s horizontally.

Notes

  • This is only working with pd.DataFrames, not with iterables of pd.DataFrames!

  • The DataFrames in the DataBundle have to have the same data columns!

Examples

>>> db = db1.stack_v(db2, datasets=["data", "mask"])
unique(**kwargs)[source]

Get unique values of data.

Parameters:

**kwargs (Any) – Additional keyword-arguments for getting unique values.

Return type:

dict[str | tuple[str, str], dict[Any, int]]

Returns:

dict – Dictionary with unique values.

Notes

For more information see unique()

Examples

>>> db.unique(columns=("c1", "B1"))
validate_datetime(imodel=None, **kwargs)[source]

Validate datetime information in data.

Parameters:
  • imodel (str, optional) – Name of the MFD/CDM data model.

  • **kwargs (Any) – Additional keyword-arguments for validating datetime.

Return type:

DataFrame

Returns:

pd.DataFrame – DataFrame containing True and False values for each index in data. True: All datetime information in data row are valid. False: At least one datetime information in data row is invalid.

See also

DataBundle.validate_id

Validate station id information in data.

DataBundle.correct_datetime

Correct datetime information in data.

DataBundle.correct_pt

Correct platform type information in data.

Notes

For more information see validate_datetime()

Examples

>>> val_dt = db.validate_datetime()
validate_id(imodel=None, **kwargs)[source]

Validate station id information in data.

Parameters:
  • imodel (str, optional) – Name of the MFD/CDM data model.

  • **kwargs (Any) – Additional keyword-arguments for validating station id.

Return type:

DataFrame

Returns:

pd.DataFrame – DataFrame containing True and False values for each index in data. True: All station ID information in data row are valid. False: At least one station ID information in data row is invalid.

See also

DataBundle.validate_datetime

Validate datetime information in data.

DataBundle.correct_pt

Correct platform type information in data.

DataBundle.correct_datetime

Correct datetime information in data.

Notes

For more information see validate_id()

Examples

>>> val_dt = db.validate_id()
write(dtypes=None, parse_dates=None, encoding=None, mode=None, **kwargs)[source]

Write data on disk.

Parameters:
  • dtypes (dict, optional) – Data types of data.

  • parse_dates (list or bool, optional) – Information how to parse dates on data.

  • encoding (str, optional) – The encoding of the input file. Overrides the value in the imodel schema file.

  • mode ({data, tables}, optional) – Data mode.

  • **kwargs (Any) – Additional keword-arguments for writing data in disk.

See also

write_data

Write MDF data and validation mask to disk.

write_tables

Write CDM tables to disk.

read

Read original marine-meteorological data as well as MDF data or CDM tables from disk.

read_data

Read MDF data and validation mask from disk.

read_mdf

Read original marine-meteorological data from disk.

Return type:

None

Notes

If mode is “data” write data using write_data(). If mode is “tables” write data using write_tables().

Examples

>>> db.write()
read_tables : Read CDM tables from disk.
class cdm_reader_mapper.DupDetect(data, compared, method, method_kwargs, compare_kwargs)[source]

Bases: object

Class to detect, flag, and remove duplicate entries in a DataFrame using a comparison matrix from recordlinkage.

Parameters:
  • data (pd.DataFrame) – Original dataset.

  • compared (pd.DataFrame) – Comparison matrix of the dataset.

  • method (str) – Duplicate detection method used for recordlinkage indexing.

  • method_kwargs (dict) – Keyword arguments for recordlinkage indexing method.

  • compare_kwargs (dict) – Keyword arguments used for recordlinkage.Compare.

flag_duplicates(keep='first', limit='default', equal_musts=None)[source]

Get result dataset with flagged duplicates.

Parameters:
  • keep (str or int, default: first) – Which entry should be kept in result dataset.

  • limit (str, int or float, optional) – Limit of total score that as to be exceeded to be declared as a duplicate. Defaults to .991.

  • equal_musts (str or list, optional) – Hashable of column name(s) that must totally be equal to be declared as a duplicate. Default: All column names found in method_kwargs.

Return type:

DataFrame

Returns:

pd.DataFrame – Input DataFrame with flagged duplicates, including duplicate_status and quality_flag.

References

get_duplicates(keep='first', limit='default', equal_musts=None, overwrite=True)[source]

Identify duplicate matches based on the comparison matrix.

Parameters:
  • keep (str or int) – Which entry to keep: ‘first’, ‘last’, or -1, 0.

  • limit (str or float, optional, default: default) – Threshold of total similarity score to consider as duplicate.

  • equal_musts (str or list[str], optional) – Columns that must exactly match.

  • overwrite (bool, default: True) – Whether to recompute matches if already calculated.

Return type:

DataFrame

Returns:

pd.DataFrame – DataFrame containing matched duplicates.

remove_duplicates(keep='first', limit='default', equal_musts=None)[source]

Remove duplicate entries from the dataset.

Parameters:
  • keep (str or int) – Which entry to keep (‘first’ or ‘last’).

  • limit (str or float, optional) – Minimum similarity score to declare duplicates.

  • equal_musts (str or list[str], optional) – Columns that must exactly match.

Return type:

DataFrame

Returns:

pd.DataFrame – Dataset without duplicates.

cdm_reader_mapper.correct_datetime(data, imodel, log_level='INFO', base=None)[source]

Apply ICOADS deck specific datetime corrections.

Parameters:
  • data (pandas.DataFrame or Iterable[pd.DataFrame]) – Input dataset.

  • imodel (str) – Name of internally available data model, e.g. icoads_d300_704.

  • log_level (str, default: INFO) – Level of logging information to save.

  • base (str, optional) – Base path for datetime correction metadata. If None use internal correction path.

Return type:

DataFrame | Iterable[DataFrame]

Returns:

pandas.DataFrame or Iterable[pd.DataFrame] – A pandas.DataFrame or Iterable[pd.DataFrame] with the adjusted data.

Raises:
  • ValueError – If _correct_dt raises an error during correction.

  • TypeError – If data is not a pd.DataFrame or an Iterable[pd.DataFrame]. If data is a pd.Series.

cdm_reader_mapper.correct_pt(data, imodel, log_level='INFO', base=None)[source]

Apply ICOADS deck specific platform ID corrections.

Parameters:
  • data (pandas.DataFrame or Iterable[pd.DataFrame]) – Input dataset.

  • imodel (str) – Name of internally available data model, e.g. icoads_d300_704.

  • log_level (str, default: INFO) – Level of logging information to save.

  • base (str, optional) – Base path for datetime correction metadata. If None use internal correction path.

Return type:

DataFrame | Iterable[DataFrame]

Returns:

pandas.DataFrame or Iterable[pd.DataFrame] – A pandas.DataFrame or Iterable[pd.DataFrame] with the adjusted data.

Raises:
  • ValueError – If _correct_pt raises an error during correction. If platform column is not defined in properties file.

  • TypeError – If data is not a pd.DataFrame or an Iterable[pd.DataFrame]. If data is a pd.Series.

cdm_reader_mapper.duplicate_check(data, method='SortedNeighbourhood', method_kwargs=None, compare_kwargs=None, table_name=None, ignore_columns=None, ignore_entries=None, offsets=None, reindex_by_null=True, null_label='null')[source]

Run a duplicate check on a dataset using recordlinkage.

Returns a DupDetect object.

Parameters:
  • data (pandas.DataFrame) – Dataset for duplicate check.

  • method (str, default: SortedNeighbourhood) – Duplicate check method for recordlinkage.

  • method_kwargs (dict, optional) – Keyword arguments for recordlinkage duplicate check. Defaults to _method_kwargs.

  • compare_kwargs (dict, optional) – Keyword arguments for recordlinkage.Compare object. Defaults to _compare_kwargs.

  • table_name (str, optional) – Name of the CDM table to be selected from data.

  • ignore_columns (str or list, optional) – Name of data columns to be ignored for duplicate check.

  • ignore_entries (dict, optional) – Key: Column name. Value: value to be ignored. E.g. offsets={“station_speed”: null}.

  • offsets (dict, optional) – Change offsets for recordlinkage Compare object. Key: Column name. Value: new offset. E.g. offsets={“latitude”: 0.1}.

  • reindex_by_null (bool, optional) – If True data is re-indexed in ascending order according to the number of nulls in each row.

  • null_label (str, optional) – Null label which is used if reindex_by_null is True.

Return type:

DupDetect

Returns:

cdm_reader_mapper.DupDetect – A DupDetect instance.

cdm_reader_mapper.map_model(data, imodel, cdm_subset=None, codes_subset=None, cdm_complete=True, drop_missing_obs=True, drop_duplicates=True, log_level='INFO')[source]

Map a pandas DataFrame to the CDM header and observational tables.

Parameters:
  • data (pandas.DataFrame or Iterable[pd.DataFrame]) – Input data to map.

  • imodel (str) – A specific mapping from generic data model to CDM, like map a SID-DCK from IMMA1’s core and attachments to CDM in a specific way, e.g. icoads_r300_d704.

  • cdm_subset (str or list, optional) – Subset of CDM model tables to map. Defaults to the full set of CDM tables defined for the imodel.

  • codes_subset (str or list, optional) – Subset of code mapping tables to map. Default to the full set of code mapping tables defined for the imodel.

  • cdm_complete (bool, default: True) – If True map entire CDM tables list.

  • drop_missing_obs (bool, default: True) – If True Drop observations without a valid observation value, e.g. no air_temperature value.

  • drop_duplicates (bool, default: True) – If True drop duplicated rows.

  • log_level (str, default: INFO) – Level of logging information to save.

Return type:

DataFrame | ParquetStreamReader

Returns:

cdm_tables (pandas.DataFrame) – DataFrame with MultiIndex columns (cdm_table, column_name).

Raises:
  • ValueError

    • If imodel is not defined. - If first split entry (‘_’) of imodel is not defined. - If mapping does not return a DataFame.

  • TypeError

    • If type of imodel is not supported. - If anything during mapping fails.

cdm_reader_mapper.read(source, mode='mdf', **kwargs)[source]

Read either original marine-meteorological data or MDF data or CDM tables from disk.

Parameters:
  • source (str) – Source of the input data.

  • mode (str, {mdf, data, tables}, default: mdf) –

    Read data mode:

    • “mdf” to read original marine-meteorological data from disk and convert them to MDF data

    • “data” to read MDF data from disk

    • “tables” to read CDM tables from disk. Map MDF data to CDM tables with DataBundle.map_model().

  • **kwargs (Any) – Additional keyword-arguments passed to reader function.

Return type:

DataBundle

Returns:

DataBundle – Containing read data as pd.DataFrame or Iterable of pd.DataFrames.

See also

read_mdf

Read original marine-meteorological data from disk.

read_data

Read MDF data and validation mask from disk.

read_tables

Read CDM tables from disk.

write

Write either MDF data or CDM tables on disk.

write_data

Write MDF data and validation mask to disk.

write_tables

Write CDM tables to disk.

Notes

kwargs are the keyword arguments for the specific mode reader.

cdm_reader_mapper.read_data(data_file, mask_file=None, info_file=None, data_format='parquet', imodel=None, col_subset=None, encoding=None, delimiter=None, **kwargs)[source]

Read MDF data which is already on a pre-defined data model.

Parameters:
  • data_file (str) – The data file (including path) to be read.

  • mask_file (str, optional) – The validation file (including path) to be read.

  • info_file (str, optional) – The information file (including path) to be read.

  • data_format ({"csv", "parquet", "feather"}, default: "parquet") – Format of input data file(s).

  • imodel (str, optional) – Name of internally available input data model, e.g. icoads_r300_d704.

  • col_subset (str, tuple or list, optional) – Specify the section or sections of the file to write.

    • For multiple sections of the tables: e.g col_subset = [columns0,…,columnsN]

    • For a single section: e.g. list type object col_subset = [columns]

    Column labels could be both string or tuple.

  • encoding (str, optional) – The encoding of the input file. Overrides the value in the imodel schema file.

  • delimiter (str, optional) – The delimiter used in the input file. Overrides the value in the imodel schema file.

  • **kwargs (Any) – Key-word arguments that will be passed to read fuunction.

Return type:

DataBundle

Returns:

cdm_reader_mapper.DataBundle – DataBundle containing MDF data.

See also

read

Read original marine-meteorological data as well as MDF data or CDM tables from disk.

read_mdf

Read original marine-meteorological data from disk.

read_tables

Read CDM tables from disk.

write

Write both MDF data or CDM tables to disk.

write_data

Write MDF data and validation mask to disk.

write_tables

Write CDM tables to disk.

cdm_reader_mapper.read_mdf(source, imodel=None, ext_schema_path=None, ext_schema_file=None, ext_table_path=None, year_init=None, year_end=None, encoding=None, chunksize=None, skiprows=None, convert_flag=True, converter_dict=None, converter_kwargs=None, decode_flag=True, decoder_dict=None, validate_flag=True, sections=None, excludes=None, pd_kwargs=None, xr_kwargs=None)[source]

Read data files compliant with a user specific data model.

Reads a data file to a pandas DataFrame using a pre-defined data model. Read data is validates against its data model producing a boolean mask on output.

The data model needs to be input to the module as a named model (included in the module) or as the path to a valid data model.

Parameters:
  • source (str) – The file (including path) to be read.

  • imodel (str) – Name of internally available input data model, e.g. icoads_r300_d704.

  • ext_schema_path (str or Path-like, optional) – The path to the external input data model schema file. The schema file must have the same name as the directory. One of imodel and ext_schema_path or ext_schema_file must be set.

  • ext_schema_file (str or Path-like, optional) – The external input data model schema file. One of imodel and ext_schema_path or ext_schema_file must be set.

  • ext_table_path (str or Path-like, optional) – The path to the external table file. The table file must have the same name as the directory.

  • year_init (str or int, optional) – Left border of time axis.

  • year_end (str or int, optional) – Right border of time axis.

  • encoding (str, optional) – The encoding of the input file. Overrides the value in the imodel schema file.

  • chunksize (int, optional) – Number of reports per chunk.

  • skiprows (int, optional) – Number of initial rows to skip from file.

  • convert_flag (bool, default: True) – If True convert entries by using a pre-defined data model.

  • converter_dict (dict of {Hashable: func}, optional) – Functions for converting values in specific columns. If None use information from a pre-defined data model.

  • converter_kwargs (dict of {Hashable: kwargs}, optional) – Key-word arguments for converting values in specific columns. If None use information from a pre-defined data model.

  • decode_flag (bool, default: True) – If True decode entries by using a pre-defined data model.

  • decoder_dict (dict of {Hashable: func}, optional) – Functions for decoding values in specific columns. If None use information from a pre-defined data model.

  • validate_flag (bool, default: True) – Validate data entries by using a pre-defined data model.

  • sections (list, optional) – List with subset of data model sections to output. If None read pre-defined data model sections.

  • excludes (str or list of str, optional) – MDF Sections to exclude.

  • pd_kwargs (dict, optional) – Additional pandas arguments.

  • xr_kwargs (dict, optional) – Additional xarray arguments.

Return type:

DataBundle

Returns:

cdm_reader_mapper.DataBundle – DaaBundle containing MDF data.

See also

read

Read either original marine-meteorological or MDF data or CDM tables from disk.

read_data

Read MDF data and validation mask from disk.

read_tables

Read CDM tables from disk.

write

Write either MDF data or CDM tables to disk.

write_data

Write MDF data and validation mask to disk.

write_tables

Write CDM tables to disk.

cdm_reader_mapper.read_tables(source, data_format='parquet', prefix=None, suffix=None, extension=None, separator='-', cdm_subset=None, col_subset=None, delimiter='|', na_values=None, null_label='null', from_str=None, to_str=None, imodel=None, **kwargs)[source]

Read CDM-table-like files from file system to a pandas.DataFrame.

Parameters:
  • source (str) – The file (including path) or the path to the file(s) to be read.

  • data_format ({"csv", "parquet", "feather"}, default: "parquet") – Format of input data file(s).

  • prefix (str, optional) – Prefix of file name structure: <prefix>-<table>-*<suffix>.<extension>. Could de used if source is a valid directory path.

  • suffix (str, optional) – Suffix of file name structure: <prefix>-<table>-*<suffix>.<extension>. Could de used if source is a valid directory path.

  • extension (str, optional) – Extension of file name structure: <prefix>-<table>-*<suffix>.<extension>. Could de used if source is a valid directory path.

  • separator (str, default: -) – Separator to join the file name pattern components.

  • cdm_subset (str or list, optional) – Specifies a subset of tables or a single table.

    • For multiple subsets of tables: This function returns a pandas.DataFrame that is multi-index at the columns, with (table-name, field) as column names. Tables are merged via the report_id field.

    • For a single table: This function returns a pandas.DataFrame with a simple indexing for the columns.

    Required if source is a valid file name.

  • col_subset (str, list or dict, optional) – Specify the section or sections of the file to read.

    • For multiple sections of the tables: e.g col_subset = {table0:[columns0],…tableN:[columnsN]}

    • For a single section: e.g. list type object col_subset = [columns] This variable assumes that the column names are all conform to the cdm field names.

  • delimiter (str, default: |) – Character or regex pattern to treat as the delimiter while reading with pandas.read_csv.

  • na_values (hashable, iterable of hashable or dict of {Hashable: Iterable}, optional) – Additional strings to recognize as Na/NaN while reading input file with pandas.read_csv. For more details see: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

  • null_label (str, default: null) – String how to label non valid values in data.

  • from_str (bool, optional) – If True convert original string data to imodel-specific data types.

  • to_str (bool, optional) – If True convert original imodel-specific data types to strings.

  • imodel (str , *optional*) – Name of data model, e.g. icoads. Must be set if either from_str or to_str is set.

  • **kwargs (Any) – Additional keyword-arguments pass to data reader.

Return type:

DataBundle

Returns:

cdm_reader_mapper.DataBundle – DataBundle instance containing successfully read CDM table(s).

See also

read

Read either original marine-meteorological data or MDF data or CDM tables from disk.

read_data

Read MDF data and validation mask from disk.

read_mdf

Read original marine-meteorological data from disk.

write

Write either MDF data or CDM tables to disk.

write_tables

Write CDM tables to disk.

write_data

Write MDF data and validation mask to disk.

cdm_reader_mapper.replace_columns(df_l, df_r, pivot_c=None, pivot_l=None, pivot_r=None, rep_c=None, rep_map=None)[source]

Replace columns in one DataFrame using row-matching from another.

This function works for both a pd.DataFrame and any Iterable of of pandas DataFrames.

Parameters:
  • df_l (pandas.DataFrame or Iterable[pd.dataFrame]) – The left DataFrame whose columns will be replaced.

  • df_r (pandas.DataFrame or Iterable[pd.dataFrame]) – The right DataFrame providing replacement values.

  • pivot_c (str, optional) – A single pivot column present in both DataFrames. Overrides pivot_l and pivot_r.

  • pivot_l (str, optional) – Pivot column in df_l. Used only when pivot_c is not supplied.

  • pivot_r (str, optional) – Pivot column in df_r. Used only when pivot_c is not supplied.

  • rep_c (str or list of str, optional) – One or more column names to replace in df_l. Ignored if rep_map is supplied.

  • rep_map (dict, optional) – Mapping between left and right column names as {left_col: right_col}.

Returns:

pd.DataFrame or ParquetStreamReader – Updated data with replacements applied.

Raises:
  • TypeError – If df_l or df_r is not a pandas DataFrame.

  • ValueError

    • If one of pivot_l and pivot_r is not defined. - If rep_map and rep_c is not defined. - If replacement source columns not found in df_r.

Notes

This function logs errors and returns None instead of raising exceptions.

cdm_reader_mapper.split_by_boolean(data, mask, boolean, reset_index=False, inverse=False, return_rejected=False)[source]

Split both a DataFrame and an Iterable of DataFrames using a boolean mask via split_dataframe_by_boolean.

Parameters:
  • data (pd.DataFrame or iterable of pd.DataFrame) – DataFrame to be split.

  • mask (pd.DataFrame or iterable of pd.DataFrame) – Boolean mask with the same length as data.

  • boolean (bool) – Determines mask interpretation:

    • If True select rows where all mask columns are True.

    • If False select rows where any mask column is False.

  • reset_index (bool, optional) – If True reset the index of returned DataFrames.

  • inverse (bool, optional) – If True invert the selection performed by the underlying function.

  • return_rejected (bool, optional) – If True return rejected rows as the second output. If False the rejected output is empty but dtype-preserving.

Return type:

tuple[DataFrame | ParquetStreamReader, DataFrame | ParquetStreamReader, Index | MultiIndex, Index | MultiIndex]

Returns:

tuple of pd.DataFrame or ParquetStreamReader and pd.DataFrame or ParquetStreamReader and pd.Index or pd.MultiIndex and pd.Index or pd.MultiIndex – Selected rows (all mask columns True), rejected rows, original indexes of selection and original indexes of rejection.

cdm_reader_mapper.split_by_boolean_false(data, mask, reset_index=False, inverse=False, return_rejected=False)[source]

Split both a DataFrame or an Iterable of DataFrames where boolean mask is False.

Parameters:
  • data (pd.DataFrame or Iterable[pd.DataFrame]) – DataFrame to be split.

  • mask (pd.DataFrame or Iterable[pd.DataFrame]) – Boolean mask with the same length as data.

  • reset_index (bool, optional) – If True reset indices in returned DataFrames.

  • inverse (bool, optional) – If True invert the selection.

  • return_rejected (bool, optional) – If True return rejected rows as the second output. If False the rejected output is empty but dtype-preserving.

Return type:

tuple[DataFrame | ParquetStreamReader, DataFrame | ParquetStreamReader, Index | MultiIndex, Index | MultiIndex]

Returns:

tuple of pd.DataFrame or ParquetStreamReader and pd.DataFrame or ParquetStreamReader and pd.Index or pd.MultiIndex and pd.Index or pd.MultiIndex – Selected rows (all mask columns True), rejected rows, original indexes of selection and original indexes of rejection.

cdm_reader_mapper.split_by_boolean_true(data, mask, reset_index=False, inverse=False, return_rejected=False)[source]

Split both a DataFrame or an Iterable of DataFrames where boolean mask is True.

Parameters:
  • data (pd.DataFrame or iterable of pd.DataFrame) – DataFrame to be split.

  • mask (pd.DataFrame or iterable of pd.DataFrame) – Boolean mask with the same length as data.

  • reset_index (bool, optional) – If True reset indices in returned DataFrames.

  • inverse (bool, optional) – If True invert the selection.

  • return_rejected (bool, optional) – If True return rejected rows as the second output. If False the rejected output is empty but dtype-preserving.

Return type:

tuple[DataFrame | ParquetStreamReader, DataFrame | ParquetStreamReader, Index | MultiIndex, Index | MultiIndex]

Returns:

tuple of pd.DataFrame or ParquetStreamReader and pd.DataFrame or ParquetStreamReader and pd.Index or pd.MultiIndex and pd.Index or pd.MultiIndex – Selected rows (all mask columns True), rejected rows, original indexes of selection and original indexes of rejection.

cdm_reader_mapper.split_by_column_entries(data, selection, reset_index=False, inverse=False, return_rejected=False)[source]

Split both a DataFrame or an Iterable of DataFrames based on matching values in a given column.

Parameters:
  • data (pd.DataFrame or iterable of pd.DataFrame) – DataFrame to be split.

  • selection (dict) – Mapping of a column name to an iterable of allowed values. Example: {“city”: [“London”, “Berlin”]}.

  • reset_index (bool, optional) – Whether to reset index in returned DataFrames.

  • inverse (bool, optional) – If True invert the selection.

  • return_rejected (bool, optional) – If True return rejected rows as the second output. If False the rejected output is empty but dtype-preserving.

Return type:

tuple[DataFrame | ParquetStreamReader, DataFrame | ParquetStreamReader, Index | MultiIndex, Index | MultiIndex]

Returns:

tuple of pd.DataFrame or ParquetStreamReader and pd.DataFrame or ParquetStreamReader and pd.Index or pd.MultiIndex and pd.Index or pd.MultiIndex – Selected rows (all mask columns True), rejected rows, original indexes of selection and original indexes of rejection.

cdm_reader_mapper.split_by_index(data, index, reset_index=False, inverse=False, return_rejected=False)[source]

Split both a DataFrame or an Iterable of DataFrames by selecting specific index labels.

Parameters:
  • data (pd.DataFrame or iterable of DataFrame) – DataFrame to be split.

  • index (label or sequence of labels) – Index values to select.

  • reset_index (bool, optional) – If True reset index in returned DataFrames.

  • inverse (bool, optional) – If True select rows not in index.

  • return_rejected (bool, optional) – If True return rejected rows as the second output. If False the rejected output is empty but dtype-preserving.

Return type:

tuple[DataFrame | ParquetStreamReader, DataFrame | ParquetStreamReader, Index | MultiIndex, Index | MultiIndex]

Returns:

tuple of pd.DataFrame or ParquetStreamReader and pd.DataFrame or ParquetStreamReader and pd.Index or pd.MultiIndex and pd.Index or pd.MultiIndex – Selected rows (all mask columns True), rejected rows, original indexes of selection and original indexes of rejection.

cdm_reader_mapper.unique(data, columns=None)

Count unique values per column in a DataFrame or a Iterable of DataFrame.

Parameters:
  • data (pandas.DataFrame or Iterable[pd.DataFrame]) – Input dataset.

  • columns (str, list or tuple, optional) – Name(s) of the data column(s) to be selected. If None, all columns are used.

Return type:

dict[str | tuple[str, str], dict[Any, int]]

Returns:

Dict[str | tuple[str, str], int] – Dictionary where each key is a column name, and each value is a dictionary mapping unique values (including NaN as ‘nan’) to their counts.

Notes

  • Works with large files via ParquetStreamReader by iterating through chunks.

cdm_reader_mapper.validate_datetime(data, imodel, blank=False, log_level='INFO')[source]

Validate datetime columns in a dataset according to the specified model.

Parameters:
  • data (pd.DataFrame, pd.Series, or Iterable[pd.DataFrame, pd.Series]) – Input dataset or series containing ID values.

  • imodel (str) – Name of internally available data model, e.g., “icoads_r300_d201”.

  • blank (bool, optional) – If True, empty values are considered valid. Default is False.

  • log_level (str, optional) – Logging level. Default is “INFO”.

Return type:

Series

Returns:

pd.Series or None – Boolean Series indicating whether each ID is valid. Returns None if validation cannot be performed due to missing data, columns, or deck definitions.

Raises:
  • TypeError – If data is not a pd.DataFrame or a pd.Series or an Iterable[pd.DataFrame | pd.Series].

  • ValueError – If no columns found for datetime conversion.

cdm_reader_mapper.validate_id(data, imodel, blank=False, log_level='INFO')[source]

Validate ID column(s) in a dataset against deck-specific patterns.

Parameters:
  • data (pd.DataFrame, pd.Series, or Iterable[pd.DataFrame, pd.Series]) – Input dataset or series containing ID values.

  • imodel (str) – Name of internally available data model, e.g., “icoads_r300_d201”.

  • blank (bool, optional) – If True, empty values are considered valid. Default is False.

  • log_level (str, optional) – Logging level. Default is “INFO”.

Return type:

Series | None

Returns:

pd.Series or None – Boolean Series indicating whether each ID is valid. Returns None if validation cannot be performed due to missing data, columns, or deck definitions.

Raises:
  • TypeError – If data is not a pd.DataFrame or a pd.Series or an Iterable[pd.DataFrame | pd.Series].

  • Value Error – If dataset imodel has no deck information. If no ID conversion columns found. If input deck is not defined in ID library files.

  • FilenotFounderror – If dataset imodel has no ID deck library.

Notes

  • Uses _get_id_col to determine which column(s) contain IDs.

  • Uses _get_patterns to get regex patterns for the deck.

  • Empty values match “^$” pattern if blank=True.

cdm_reader_mapper.write(data, mode='data', **kwargs)[source]

Write either MDF data or CDM tables on disk.

Parameters:
  • data (pandas.DataFrame or Iterable[pd.DataFrame]) – Data to export.

  • mode (str, {data, tables}, default: data) –

    Write data mode:

    • “data” to write MDF data to disk

    • “tables” to write CDM tables to disk. Map MDF data to CDM tables with DataBundle.map_model().

  • **kwargs (Any) – Additional key-word arguments used to write data on disk.

See also

write_data

Write MDF data and validation mask to disk.

write_tables

Write CDM tables to disk.

read

Read either original marine-meteorological data or MDF data or CDM tables from disk.

read_mdf

Read original marine-meteorological data from disk.

read_data

Read MDF data and validation mask from disk.

read_tables

Read CDM tables from disk.

Return type:

None

Notes

kwargs are the keyword arguments for the specific mode reader.

cdm_reader_mapper.write_data(data, mask=None, data_format='parquet', dtypes=None, parse_dates=False, encoding='utf-8', out_dir='.', prefix=None, suffix=None, extension=None, filename=None, separator='_', col_subset=None, delimiter=',', **kwargs)[source]

Write pandas.DataFrame to MDF file on file system.

Parameters:
  • data (pandas.DataFrame or Iterable[pd.DataFrame]) – Data to export.

  • mask (pandas.DataFrame or Iterable[pd.DataFrame], optional) – Validation mask to export.

  • data_format ({"csv", "parquet", "feather"}, default: "parquet") – Format of output data file(s).

  • dtypes (dict, optional) – Dictionary of data types on data. Dump dtypes and parse_dates to json information file.

  • parse_dates (list | bool, default: False) – Information of how to parse dates in data. Dump dtypes and parse_dates to json information file. For more information see pandas.read_csv().

  • encoding (str, default: "utf-8") – A string representing the encoding to use in the output file, defaults to utf-8.

  • out_dir (str, default: ".") – Path to the output directory.

  • prefix (str, optional) – Prefix of file name structure: <prefix>-data-*<suffix>.<extension>.

  • suffix (str, optional) – Suffix of file name structure: <prefix>-data-*<suffix>.<extension>.

  • extension (str, optional) – Extension of file name structure: <prefix>-data-*<suffix>.<extension>. By default, extension depends on data_format.

  • filename (str or dict, optional) – Name of the output file name(s). List one filename for both data and mask ({“data”:<filenameD>, “mask”:<filenameM>}). By default, automatically create file name from table name, prefix and suffix.

  • separator (str, optional) – Separator to join the file name pattern components (default “_”).

  • col_subset (str, tuple or list, optional) – Specify the section or sections of the file to write.

    • For multiple sections of the tables: e.g col_subset = [columns0,…,columnsN]

    • For a single section: e.g. list type object col_subset = [columns]

    Column labels could be both string or tuple.

  • delimiter (str, default: ",") – Character or regex pattern to treat as the delimiter while reading with df.to_csv.

  • **kwargs (Any) – Additional keyword-arguments passed to to_csv when data_format is ‘csv’.

Raises:

ValueError – If data_foramt is not one of ‘csv’, ‘parquet’ or ‘feather’. If type of data and type of mask do not match.

See also

write

Write either MDF data or CDM tables to disk.

write_tables

Write CDM tables to disk.

read

Read either original marine-meteorological data or MDF data or CDM tables from disk.

read_data

Read MDF data and validation mask from disk.

read_mdf

Read original marine-meteorological data from disk.

read_tables

Read CDM tables from disk.

Notes

Use this function after reading MDF data.

Return type:

None

cdm_reader_mapper.write_tables(data, data_format='parquet', out_dir=None, prefix=None, suffix=None, extension=None, filename=None, separator='-', cdm_subset=None, col_subset=None, delimiter='|', encoding='utf-8', from_str=None, to_str=None, imodel=None, **kwargs)[source]

Write pandas.DataFrame to CDM-table file on file system.

Parameters:
  • data (pandas.DataFrame) – Data to export.

  • data_format ({"csv", "parquet", "feather"}, default: "parqeut") – Format of output data file(s).

  • out_dir (str, optional) – Path to the output directory. Defaults to current directory.

  • prefix (str, optional) – Prefix of file name structure: <prefix><separator><table><separator>*<suffix>.<extension>.

  • suffix (str, optional) – Suffix of file name structure: <prefix><separator><table><separator>*<suffix>.<extension>.

  • extension (str, optional) – Extension of file name structure: <prefix><separator><table><separator>*<suffix>.<extension>.

  • filename (str, Path-like or dict, optional) – Name of the output file name(s). List one filename for each table name in data ({<table>:<filename>}). If None, automatically create file name from table name, prefix and suffix.

  • separator (str, optional) – Separator of file name structure: <prefix><separator><table><separator>*<suffix>.<extension>.

  • cdm_subset (str or list of str, optional) – Specifies a subset of tables or a single table.

    • For multiple subsets of tables: This function returns a pandas.DataFrame that is multi-index at the columns, with (table-name, field) as column names. Tables are merged via the report_id field.

    • For a single table: This function returns a pandas.DataFrame with a simple indexing for the columns.

  • col_subset (str, list or dict, optional) – Specify the section or sections of the file to write.

    • For multiple sections of the tables: e.g col_subset = {table0:[columns0],…tableN:[columnsN]}

    • For a single section: e.g. list type object col_subset = [columns] This variable assumes that the column names are all conform to the cdm field names.

  • delimiter (str, default: "|") – Character or regex pattern to treat as the delimiter while reading with df.to_csv. This is only relevant if data_format is “csv”.

  • encoding (str) – A string representing the encoding to use in the output file, defaults to utf-8. This is only relevant if data_format is “csv”.

  • from_str (bool, optional) – If True convert original string data to imodel-specific data types.

  • to_str (bool, optional) – If True convert original imodel-specific data types to strings.

  • imodel (str , *optional*) – Name of data model, e.g. icoads. Must be set if either from_str or to_str is set.

  • **kwargs (Any) – Additional keyword-arguments that will be ignored.

See also

write

Write either MDF data or CDM tables to disk.

write_data

Write MDF data and validation mask to disk.

read

Read either original marine-meteorological data or MDF data or CDM tables from disk.

read_tables

Read CDM tables from disk.

read_data

Read MDF data and validation mask from disk.

read_mdf

Read original marine-meteorological data from disk.

Return type:

None

Notes

  • Use this function after reading CDM tables.

  • kwargs will be ignored!

Subpackages

Submodules

cdm_reader_mapper.properties module

Common Data Model (CDM) reader and mapper common properties.