cdm_reader_mapper package¶
Common Data Model (CDM) reader and mapper package.
- class cdm_reader_mapper.DataBundle(data=None, columns=None, dtypes=None, parse_dates=None, encoding=None, mask=None, imodel=None, mode='data')[source]¶
Bases:
objectContainer for tabular data and associated metadata.
This class wraps either an in-memory pd.DataFrame or a ParquetStreamReader for chunked, disk-backed processing. It provides a unified interface for accessing DataFrame-like attributes and methods, transparently handling streaming data where required.
- Parameters:
data (
pandas.DataFrameorIterable[pandas.DataFrame]orParquetStreamReader, optional) – Input data. If an iterable is provided, it is converted into a ParquetStreamReader for streaming.columns (
pandas.Indexorpandas.MultiIndexorlist, optional) – Column labels used when initializing empty data.dtypes (
pandas.Seriesordict, optional) – Data types for columns.parse_dates (
listorbool, optional) – Instructions for parsing dates.encoding (
str, optional) – Encoding associated with the data.mask (
pandas.DataFrameorIterable[pandas.DataFrame]orParquetStreamReader, optional) – Boolean mask aligned with data. If not provided, an empty mask is created.imodel (
str, optional) – Name of the input data model.mode (
{"data", "tables"}, default"data") – Data representation mode.
Examples
Getting a
DataBundlewhile reading data from disk.>>> from cdm_reader_mapper import read_mdf >>> db = read_mdf(source="file_on_disk", imodel="custom_model_name")
Constructing a
DataBundlefrom already read MDf data.>>> from cdm_reader_mapper import DataBundle >>> read = read_mdf(source="file_on_disk", imodel="custom_model_name") >>> data_ = read.data >>> mask_ = read.mask >>> db = DataBundle(data=data_, mask=mask_)
Constructing a
DataBundlefrom already read CDM data.>>> from cdm_reader_mapper import read_tables >>> tables = read_tables("path_to_files").data >>> db = DataBundle(data=tables, mode="tables")
- add(addition, inplace=False)[source]¶
Adding information to a
DataBundle.- Parameters:
addition (
dict) – Additional elements to add to theDataBundle.inplace (
bool, default:False) – If True add datasets inDataBundleelse return a copy ofDataBundlewith added datasets.
- Return type:
- Returns:
DataBundleorNone– DataBundle with added information or None if “inplace=True”.
Examples
>>> tables = read_tables("path_to_files") >>> db = db.add({"data": tables})
- property columns: pandas.core.indexes.base.Index | pandas.core.indexes.multi.MultiIndex¶
Column labels of
data.- Returns:
pd.Indexorpd.MultiIndex– Column labels of the underlying MDf data.
- copy()[source]¶
Make deep copy of a
DataBundle.- Return type:
- Returns:
DataBundle– Copy of a DataBundle.
Examples
>>> db2 = db.copy()
- correct_datetime(imodel=None, inplace=False, **kwargs)[source]¶
Correct datetime information in
data.- Parameters:
imodel (
str, optional) – Name of the MFD/CDM data model.inplace (
bool, default:False) – If True overwritedatainDataBundleelse return a copy ofDataBundlewith datetime-corrected values indata.**kwargs (
Any) – Additional keyword-arguments for correcting datetime.
- Return type:
- Returns:
DataBundleorNone– DataBundle with corrected datetime information or None if “inplace=True”.
See also
DataBundle.correct_ptCorrect platform type information in data.
DataBundle.validate_datetimeValidate datetime information in data.
DataBundle.validate_idValidate station id information in data.
Notes
For more information see
correct_datetime()Examples
>>> df_dt = db.correct_datetime()
- correct_pt(imodel=None, inplace=False, **kwargs)[source]¶
Correct platform type information in
data.- Parameters:
imodel (
str, optional) – Name of the MFD/CDM data model.inplace (
bool, default:True) – If True overwritedatainDataBundleelse return a copy ofDataBundlewith platform-corrected values indata.**kwargs (
Any) – Additional keyword-arguments for correcting platform type.
- Return type:
- Returns:
DataBundleorNone– DataBundle with corrected platform type information or None if “inplace=True”.
See also
DataBundle.correct_datetimeCorrect datetime information in data.
DataBundle.validate_idValidate station id information in data.
DataBundle.validate_datetimeValidate datetime information in data.
Notes
For more information see
correct_pt()Examples
>>> df_pt = db.correct_pt()
- property data: pandas.core.frame.DataFrame | ParquetStreamReader¶
Underlying MDF data.
- Returns:
pd.DataFrameorParquetStreamReader– Underlying MDf data.
- property dtypes: pandas.core.series.Series | dict[str, Any] | None¶
Dictionary of data types on
data.
- duplicate_check(inplace=False, **kwargs)[source]¶
Duplicate check in
data.- Parameters:
inplace (
bool, default:False) – If True overwritedatainDataBundleelse return a copy ofDataBundlewithdataas CDM tables.**kwargs (
Any) – Additional keyword-arguments for duplicate check.
- Return type:
- Returns:
DataBundleorNone– DataBundle containing newDupDetectclass for further duplicate check methods or None if “inplace=True”.
See also
DataBundle.get_duplicatesGet duplicate matches in data.
DataBundle.flag_duplicatesFlag detected duplicates in data.
DataBundle.remove_duplicatesRemove detected duplicates in data.
Notes
Following columns have to be provided:
longitude
latitude
primary_station_id
report_timestamp
station_course
station_speed
This adds a new class
DupDetecttoDataBundle. This class is necessary for further duplicate check methods.For more information see
duplicate_check()Examples
>>> db.duplicate_check()
- property encoding: str | None¶
A string representing the encoding to use in the
data.See also
pd.to_csv()Write data with encoding to CSV file.
- flag_duplicates(inplace=False, **kwargs)[source]¶
Flag detected duplicates in
data.- Parameters:
inplace (
bool, default:False) – If True overwritedatainDataBundleelse return a copy ofDataBundlewithdatacontaining flagged duplicates.**kwargs (
Any) – Additional keyword-arguments for flagging duplicates.
- Return type:
- Returns:
DataBundleorNone– DataBundle containing duplicate flags indataor None if “inplace=True”.- Raises:
RuntimeError – Before flagging duplicates, a duplictate check has to be done,
DataBundle.duplicate_check().
See also
DataBundle.remove_duplicatesRemove detected duplicates in data.
DataBundle.get_duplicatesGet duplicate matches in data.
DataBundle.duplicate_checkDuplicate check in data.
Notes
For more information see
DupDetect.flag_duplicates()Examples
Flag duplicates without overwriting
data.>>> flagged_tables = db.flag_duplicates()
Flag duplicates with overwriting
data.>>> db.flag_duplicates(inplace=True) >>> flagged_tables = db.data
- get_duplicates(**kwargs)[source]¶
Get duplicate matches in
data.- Parameters:
**kwargs (
Any) – Additional keyword-arguments used for getting duplicates.- Return type:
- Returns:
pd.DataFrame– DataFrame containing duplicate matches.- Raises:
RuntimeError – Before getting duplicates, a duplictate check has to be done,
DataBundle.duplicate_check().
See also
DataBundle.remove_duplicatesRemove detected duplicates in data.
DataBundle.flag_duplicatesFlag detected duplicates in data.
DataBundle.duplicate_checkDuplicate check in data.
Notes
For more information see
DupDetect.get_duplicates()Examples
>>> matches = db.get_duplicates()
- map_model(imodel=None, inplace=False, **kwargs)[source]¶
Map
datato the Common Data Model.- Parameters:
imodel (
str, optional) – Name of the MFD/CDM data model.inplace (
bool, default:False) – If True overwritedatainDataBundleelse return a copy ofDataBundlewithdataas CDM tables.**kwargs (
Any) – Additional keyword-arguments for mapping to CDM.
- Return type:
- Returns:
DataBundleorNone– DataBundle containingdatamapped to the CDM or None ifinplace=True.
Notes
For more information see
map_model()Examples
>>> cdm_tables = db.map_model()
- property mask: pandas.core.frame.DataFrame | ParquetStreamReader¶
MDF validation mask.
- Returns:
pd.DataFrameorParquetStreamReader– Validation mask of the underlying MDF data.
- property parse_dates: list[Any] | bool | None¶
Information of how to parse dates in
data.See also
pd.read_csv()Read CSV file using pandas.
- remove_duplicates(inplace=False, **kwargs)[source]¶
Remove detected duplicates in
data.- Parameters:
inplace (
bool, default:False) – If True overwritedatainDataBundleelse return a copy ofDataBundlewithdatacontaining no duplicates.**kwargs (
Any) – Additional keyword-arguments used to remove duplicates.
- Return type:
- Returns:
DataBundleorNone– DataBundle without duplicated rows or None if “inplace=True”.- Raises:
RuntimeError – Before removing duplicates, a duplictate check has to be done,
DataBundle.duplicate_check().
See also
DataBundle.flag_duplicatesFlag detected duplicates in data.
DataBundle.get_duplicatesGet duplicate matches in data.
DataBundle.duplicate_checkDuplicate check in data.
Notes
For more information see
DupDetect.remove_duplicates()Examples
Remove duplicates without overwriting
data.>>> removed_tables = db.remove_duplicates()
Remove duplicates with overwriting
data.>>> db.remove_duplicates(inplace=True) >>> removed_tables = db.data
- replace_columns(df_corr, subset=None, inplace=False, **kwargs)[source]¶
Replace columns in
data.- Parameters:
df_corr (
pd.DataFrame) – Data to be inplaced.subset (
strorlistofstr, optional) – Select subset by columns. This option is useful for multi-indexeddata.inplace (
bool, default:False) – If True overwritedatainDataBundleelse return a copy ofDataBundlewith replaced column names indata.**kwargs (
Any) – Additional keyword-arguments for replacing columns.
- Return type:
- Returns:
DataBundleorNone– DataBundle with replaced column names or None if “inplace=True”.
Notes
For more information see
replace_columns()Examples
>>> import pandas as pd >>> df_corr = pd.read_csv("correction_file_on_disk") >>> df_repl = db.replace_columns(df_corr)
- select_where_all_false(inplace=False, do_mask=True, **kwargs)[source]¶
Select rows from
datawhere all column entries inmaskare False.- Parameters:
inplace (
bool, default:False) – If True overwritedatainDataBundleelse return a copy ofDataBundlewith invalid values only indata.do_mask (
bool, default:True) – If True also do selection onmask.**kwargs (
Any) – Additional keyword-arguments for splitting data where all entries are False.
- Return type:
- Returns:
DataBundleorNone– DataBundle containing rows where all column entries inmaskare False or None ifinplace=True.
See also
DataBundle.select_where_all_trueSelect rows from data where all entries in mask are True.
DataBundle.select_where_entry_isinSelect rows from data where column entries are in a specific value list.
DataBundle.select_where_index_isinSelect rows from data within specific index list.
Notes
For more information see
split_by_boolean_false()Examples
Select without overwriting the old data.
>>> db_selected = db.select_where_all_false()
Select valid values only with overwriting the old data.
>>> db.select_where_all_false(inplace=True) >>> df_selected = db.data
- select_where_all_true(inplace=False, do_mask=True, **kwargs)[source]¶
Select rows from
datawhere all column entries inmaskare True.- Parameters:
inplace (
bool, default:False) – If True overwritedatainDataBundleelse return a copy ofDataBundlewith valid values only indata.do_mask (
bool, default:True) – If True also do selection onmask.**kwargs (
Any) – Additional keyword-arguments for splitting data where all entries are True.
- Return type:
- Returns:
DataBundleorNone– DataBundle containing rows where all column entries inmaskare True or None ifinplace=True.
See also
DataBundle.select_where_all_falseSelect rows from data where all entries in mask are False.
DataBundle.select_where_entry_isinSelect rows from data where column entries are in a specific value list.
DataBundle.select_where_index_isinSelect rows from data within specific index list.
Notes
For more information see
split_by_boolean_true()Examples
Select without overwriting the old data.
>>> db_selected = db.select_where_all_true()
Select overwriting the old data.
>>> db.select_where_all_true(inplace=True) >>> df_selected = db.data
- select_where_entry_isin(selection, inplace=False, do_mask=True, **kwargs)[source]¶
Select rows from
datawhere column entries are in a specific value list.- Parameters:
selection (
dict) – Keys: Column names indata. Values: Specific value list.inplace (
bool, default:False) – IfTrueoverwritedatainDataBundleelse return a copy ofDataBundlewith selected columns only indata.do_mask (
bool, default:True) – If True also do selection onmask.**kwargs (
Any) – Additional keyword-arguments for splitting data where entries within a specific value list.
- Return type:
- Returns:
DataBundleorNone– DataBundle containing rows where column entries are in a specific value list or None ifinplace=True.
See also
DataBundle.select_where_index_isinSelect rows from data within specific index list.
DataBundle.select_where_all_trueSelect rows from data where all entries in mask are True.
DataBundle.select_where_all_falseSelect rows from data where all entries in mask are False.
Notes
For more information see
split_by_column_entries()Examples
Select without overwriting the old data.
>>> db_selected = db.select_where_entry_isin( ... selection={("c1", "B1"): [26, 41]}, ... )
Select with overwriting the old data.
>>> db.select_where_entry_isin(selection={("c1", "B1"): [26, 41]}, inplace=True) >>> df_selected = db.data
- select_where_index_isin(index, inplace=False, do_mask=True, **kwargs)[source]¶
Select rows from
datawhere indexes within a specific index list.- Parameters:
inplace (
bool, default:False) – IfTrueoverwritedatainDataBundleelse return a copy ofDataBundlewith selected rows only indata.do_mask (
bool, default:True) – If True also do selection onmask.**kwargs (
Any) – Additional keyword-arguments for splitting data where indexes within a specific index list.
- Return type:
- Returns:
DataBundleorNone– DataBundle containing rows where indexes are within a specific index list or None ifinplace=True.
See also
DataBundle.select_where_entry_isinSelect rows from data where column entries are in a specific value list.
DataBundle.select_where_all_trueSelect rows from data where all entries in mask are True.
DataBundle.select_where_all_falseSelect rows from data where all entries in mask are False.
Notes
For more information see
split_by_index()Examples
Select without overwriting the old data.
>>> db_selected = db.select_where_index_isin([0, 2, 4])
Select with overwriting the old data.
>>> db.select_where_index_isin(index=[0, 2, 4], inplace=True) >>> df_selected = db.data
- split_by_boolean_false(do_mask=True, **kwargs)[source]¶
Split
databy rows where all column entries inmaskare False.- Parameters:
- Return type:
- Returns:
tuple– FirstDataBundleincluding rows where all column entries inmaskare False. SecondDataBundleincluding rows where all column entries inmaskare True.
See also
DataBundle.split_by_boolean_falseSplit data by rows where all entries in mask are True.
DataBundle.split_by_column_entriesSplit data by rows where column entries are in a specific value list.
DataBundle.split_by_indexSplit data by rows within specific index list.
Notes
For more information see
split_by_boolean_false()Examples
Split DataBundle.
>>> db_false, db_true = db.split_by_boolean_false()
- split_by_boolean_true(do_mask=True, **kwargs)[source]¶
Split
databy rows where all column entries inmaskare True.- Parameters:
- Return type:
- Returns:
tuple– FirstDataBundleincluding rows where all column entries inmaskare True. SecondDataBundleincluding rows where all column entries inmaskare False.
See also
DataBundle.split_by_boolean_falseSplit data by rows where all entries in mask are False.
DataBundle.split_by_column_entriesSplit data by rows where column entries are in a specific value list.
DataBundle.split_by_indexSplit data by rows within specific index list.
Notes
For more information see
split_by_boolean_true()Examples
Split DataBundle.
>>> db_true, db_false = db.split_by_boolean_true()
- split_by_column_entries(selection, do_mask=True, **kwargs)[source]¶
Split
databy rows where column entries are in a specific value list.- Parameters:
- Return type:
- Returns:
tuple– FirstDataBundleincluding rows where column entries are in a specific value list. SecondDataBundleincluding rows where column entries are not in a specific value list.
See also
DataBundle.split_by_indexSplit data by rows within specific index list.
DataBundle.split_by_boolean_trueSplit data by rows where all entries in mask are True.
DataBundle.split_by_boolean_falseSplit data by rows where all entries in mask are False.
Notes
For more information see
split_by_column_entries()Examples
Split DataBundle.
>>> db_isin, db_isnotin = db.split_by_column_entries( ... selection={("c1", "B1"): [26, 41]}, ... )
- split_by_index(index, do_mask=True, **kwargs)[source]¶
Split
databy rows within specific index list.- Parameters:
- Return type:
- Returns:
tuple– FirstDataBundleincluding rows within specific index list. SecondDataBundleincluding rows outside specific index list.
See also
DataBundle.split_by_column_entriesSelect columns from data with specific values.
DataBundle.split_by_boolean_trueSplit data by rows where all entries in mask are True.
DataBundle.split_by_boolean_falseSplit data by rows where all entries in mask are False.
Notes
For more information see
split_by_index()Examples
Split DataBundle.
>>> db_isin, db_isnotin = db.split_by_index([0, 2, 4])
- stack_h(other, datasets=('data', 'mask'), inplace=False, **kwargs)[source]¶
Stack multiple
DataBundle’s horizontally.- Parameters:
other (
DataBundleorSequenceofDataBundle) – List of otherDataBundleto stack horizontally.datasets (
strorSequenceofstr, default:[data,mask]) – List of datasets to be stacked.inplace (
bool, default:False) – If True overwrite datasets inDataBundleelse return a copy ofDataBundlewith stacked datasets.**kwargs (
Any) – Additional keyword-arguments for stacking DataFrames horizontally.
- Return type:
- Returns:
DataBundleorNone– Horizontally stacked DataBundle or None ifinplace=True.
See also
DataBundle.stack_vStack multiple DataBundle’s vertically.
Notes
This is only working with pd.DataFrames, not with iterables of pd.DataFrames!
The DataFrames in the
DataBundlemay have different data columns!
Examples
>>> db = db1.stack_h(db2, datasets=["data", "mask"])
- stack_v(other, datasets=('data', 'mask'), inplace=False, **kwargs)[source]¶
Stack multiple
DataBundle’s vertically.- Parameters:
other (
DataBundleorSequenceofDataBundle) – List of otherDataBundleto stack vertically.datasets (
strorSequenceofstr, default:(data,mask)) – List of datasets to be stacked.inplace (
bool, default:False) – If True overwrite datasets inDataBundleelse return a copy ofDataBundlewith stacked datasets.**kwargs (
Any) – Additional keyword-arguments for stacking DataFrames vertically.
- Return type:
- Returns:
DataBundleorNone– Vertically stacked DataBundle or None if “inplace=True”.
See also
DataBundle.stack_hStack multiple DataBundle’s horizontally.
Notes
This is only working with pd.DataFrames, not with iterables of pd.DataFrames!
The DataFrames in the
DataBundlehave to have the same data columns!
Examples
>>> db = db1.stack_v(db2, datasets=["data", "mask"])
- unique(**kwargs)[source]¶
Get unique values of
data.- Parameters:
**kwargs (
Any) – Additional keyword-arguments for getting unique values.- Return type:
- Returns:
dict– Dictionary with unique values.
Notes
For more information see
unique()Examples
>>> db.unique(columns=("c1", "B1"))
- validate_datetime(imodel=None, **kwargs)[source]¶
Validate datetime information in
data.- Parameters:
imodel (
str, optional) – Name of the MFD/CDM data model.**kwargs (
Any) – Additional keyword-arguments for validating datetime.
- Return type:
- Returns:
pd.DataFrame– DataFrame containing True and False values for each index indata. True: All datetime information indatarow are valid. False: At least one datetime information indatarow is invalid.
See also
DataBundle.validate_idValidate station id information in data.
DataBundle.correct_datetimeCorrect datetime information in data.
DataBundle.correct_ptCorrect platform type information in data.
Notes
For more information see
validate_datetime()Examples
>>> val_dt = db.validate_datetime()
- validate_id(imodel=None, **kwargs)[source]¶
Validate station id information in
data.- Parameters:
imodel (
str, optional) – Name of the MFD/CDM data model.**kwargs (
Any) – Additional keyword-arguments for validating station id.
- Return type:
- Returns:
pd.DataFrame– DataFrame containing True and False values for each index indata. True: All station ID information indatarow are valid. False: At least one station ID information indatarow is invalid.
See also
DataBundle.validate_datetimeValidate datetime information in data.
DataBundle.correct_ptCorrect platform type information in data.
DataBundle.correct_datetimeCorrect datetime information in data.
Notes
For more information see
validate_id()Examples
>>> val_dt = db.validate_id()
- write(dtypes=None, parse_dates=None, encoding=None, mode=None, **kwargs)[source]¶
Write
dataon disk.- Parameters:
dtypes (
dict, optional) – Data types of data.parse_dates (
listorbool, optional) – Information how to parse dates on data.encoding (
str, optional) – The encoding of the input file. Overrides the value in the imodel schema file.mode (
{data, tables}, optional) – Data mode.**kwargs (
Any) – Additional keword-arguments for writing data in disk.
See also
write_dataWrite MDF data and validation mask to disk.
write_tablesWrite CDM tables to disk.
readRead original marine-meteorological data as well as MDF data or CDM tables from disk.
read_dataRead MDF data and validation mask from disk.
read_mdfRead original marine-meteorological data from disk.
- Return type:
Notes
If
modeis “data” write data usingwrite_data(). Ifmodeis “tables” write data usingwrite_tables().Examples
>>> db.write() read_tables : Read CDM tables from disk.
- class cdm_reader_mapper.DupDetect(data, compared, method, method_kwargs, compare_kwargs)[source]¶
Bases:
objectClass to detect, flag, and remove duplicate entries in a DataFrame using a comparison matrix from recordlinkage.
- Parameters:
data (
pd.DataFrame) – Original dataset.compared (
pd.DataFrame) – Comparison matrix of the dataset.method (
str) – Duplicate detection method used for recordlinkage indexing.method_kwargs (
dict) – Keyword arguments for recordlinkage indexing method.compare_kwargs (
dict) – Keyword arguments used for recordlinkage.Compare.
- flag_duplicates(keep='first', limit='default', equal_musts=None)[source]¶
Get result dataset with flagged duplicates.
- Parameters:
keep (
strorint, default:first) – Which entry should be kept in result dataset.limit (
str,intorfloat, optional) – Limit of total score that as to be exceeded to be declared as a duplicate. Defaults to .991.equal_musts (
strorlist, optional) – Hashable of column name(s) that must totally be equal to be declared as a duplicate. Default: All column names found in method_kwargs.
- Return type:
- Returns:
pd.DataFrame– Input DataFrame with flagged duplicates, including duplicate_status and quality_flag.
References
- get_duplicates(keep='first', limit='default', equal_musts=None, overwrite=True)[source]¶
Identify duplicate matches based on the comparison matrix.
- Parameters:
keep (
strorint) – Which entry to keep: ‘first’, ‘last’, or -1, 0.limit (
strorfloat, optional, default: default) – Threshold of total similarity score to consider as duplicate.equal_musts (
strorlist[str], optional) – Columns that must exactly match.overwrite (
bool, default:True) – Whether to recompute matches if already calculated.
- Return type:
- Returns:
pd.DataFrame– DataFrame containing matched duplicates.
- cdm_reader_mapper.correct_datetime(data, imodel, log_level='INFO', base=None)[source]¶
Apply ICOADS deck specific datetime corrections.
- Parameters:
data (
pandas.DataFrameorIterable[pd.DataFrame]) – Input dataset.imodel (
str) – Name of internally available data model, e.g. icoads_d300_704.log_level (
str, default:INFO) – Level of logging information to save.base (
str, optional) – Base path for datetime correction metadata. If None use internal correction path.
- Return type:
- Returns:
pandas.DataFrameorIterable[pd.DataFrame]– A pandas.DataFrame or Iterable[pd.DataFrame] with the adjusted data.- Raises:
ValueError – If _correct_dt raises an error during correction.
TypeError – If data is not a pd.DataFrame or an Iterable[pd.DataFrame]. If data is a pd.Series.
- cdm_reader_mapper.correct_pt(data, imodel, log_level='INFO', base=None)[source]¶
Apply ICOADS deck specific platform ID corrections.
- Parameters:
data (
pandas.DataFrameorIterable[pd.DataFrame]) – Input dataset.imodel (
str) – Name of internally available data model, e.g. icoads_d300_704.log_level (
str, default:INFO) – Level of logging information to save.base (
str, optional) – Base path for datetime correction metadata. If None use internal correction path.
- Return type:
- Returns:
pandas.DataFrameorIterable[pd.DataFrame]– A pandas.DataFrame or Iterable[pd.DataFrame] with the adjusted data.- Raises:
ValueError – If _correct_pt raises an error during correction. If platform column is not defined in properties file.
TypeError – If data is not a pd.DataFrame or an Iterable[pd.DataFrame]. If data is a pd.Series.
- cdm_reader_mapper.duplicate_check(data, method='SortedNeighbourhood', method_kwargs=None, compare_kwargs=None, table_name=None, ignore_columns=None, ignore_entries=None, offsets=None, reindex_by_null=True, null_label='null')[source]¶
Run a duplicate check on a dataset using recordlinkage.
Returns a DupDetect object.
- Parameters:
data (
pandas.DataFrame) – Dataset for duplicate check.method (
str, default:SortedNeighbourhood) – Duplicate check method for recordlinkage.method_kwargs (
dict, optional) – Keyword arguments for recordlinkage duplicate check. Defaults to _method_kwargs.compare_kwargs (
dict, optional) – Keyword arguments for recordlinkage.Compare object. Defaults to _compare_kwargs.table_name (
str, optional) – Name of the CDM table to be selected from data.ignore_columns (
strorlist, optional) – Name of data columns to be ignored for duplicate check.ignore_entries (
dict, optional) – Key: Column name. Value: value to be ignored. E.g. offsets={“station_speed”: null}.offsets (
dict, optional) – Change offsets for recordlinkage Compare object. Key: Column name. Value: new offset. E.g. offsets={“latitude”: 0.1}.reindex_by_null (
bool, optional) – If True data is re-indexed in ascending order according to the number of nulls in each row.null_label (
str, optional) – Null label which is used if reindex_by_null is True.
- Return type:
- Returns:
cdm_reader_mapper.DupDetect– A DupDetect instance.
- cdm_reader_mapper.map_model(data, imodel, cdm_subset=None, codes_subset=None, cdm_complete=True, drop_missing_obs=True, drop_duplicates=True, log_level='INFO')[source]¶
Map a pandas DataFrame to the CDM header and observational tables.
- Parameters:
data (
pandas.DataFrameorIterable[pd.DataFrame]) – Input data to map.imodel (
str) – A specific mapping from generic data model to CDM, like map a SID-DCK from IMMA1’s core and attachments to CDM in a specific way, e.g.icoads_r300_d704.cdm_subset (
strorlist, optional) – Subset of CDM model tables to map. Defaults to the full set of CDM tables defined for the imodel.codes_subset (
strorlist, optional) – Subset of code mapping tables to map. Default to the full set of code mapping tables defined for the imodel.cdm_complete (
bool, default:True) – If True map entire CDM tables list.drop_missing_obs (
bool, default:True) – If True Drop observations without a valid observation value, e.g. no air_temperature value.drop_duplicates (
bool, default:True) – If True drop duplicated rows.log_level (
str, default:INFO) – Level of logging information to save.
- Return type:
- Returns:
cdm_tables (
pandas.DataFrame) – DataFrame with MultiIndex columns (cdm_table, column_name).- Raises:
If imodel is not defined. - If first split entry (‘_’) of imodel is not defined. - If mapping does not return a DataFame.
If type of imodel is not supported. - If anything during mapping fails.
- cdm_reader_mapper.read(source, mode='mdf', **kwargs)[source]¶
Read either original marine-meteorological data or MDF data or CDM tables from disk.
- Parameters:
source (
str) – Source of the input data.mode (
str,{mdf, data, tables}, default:mdf) –Read data mode:
“mdf” to read original marine-meteorological data from disk and convert them to MDF data
“data” to read MDF data from disk
“tables” to read CDM tables from disk. Map MDF data to CDM tables with
DataBundle.map_model().
**kwargs (
Any) – Additional keyword-arguments passed to reader function.
- Return type:
- Returns:
DataBundle– Containing read data as pd.DataFrame or Iterable of pd.DataFrames.
See also
read_mdfRead original marine-meteorological data from disk.
read_dataRead MDF data and validation mask from disk.
read_tablesRead CDM tables from disk.
writeWrite either MDF data or CDM tables on disk.
write_dataWrite MDF data and validation mask to disk.
write_tablesWrite CDM tables to disk.
Notes
kwargs are the keyword arguments for the specific mode reader.
- cdm_reader_mapper.read_data(data_file, mask_file=None, info_file=None, data_format='parquet', imodel=None, col_subset=None, encoding=None, delimiter=None, **kwargs)[source]¶
Read MDF data which is already on a pre-defined data model.
- Parameters:
data_file (
str) – The data file (including path) to be read.mask_file (
str, optional) – The validation file (including path) to be read.info_file (
str, optional) – The information file (including path) to be read.data_format (
{"csv", "parquet", "feather"}, default:"parquet") – Format of input data file(s).imodel (
str, optional) – Name of internally available input data model, e.g. icoads_r300_d704.col_subset (
str,tupleorlist, optional) – Specify the section or sections of the file to write.For multiple sections of the tables: e.g col_subset = [columns0,…,columnsN]
For a single section: e.g. list type object col_subset = [columns]
Column labels could be both string or tuple.
encoding (
str, optional) – The encoding of the input file. Overrides the value in the imodel schema file.delimiter (
str, optional) – The delimiter used in the input file. Overrides the value in the imodel schema file.**kwargs (
Any) – Key-word arguments that will be passed to read fuunction.
- Return type:
- Returns:
cdm_reader_mapper.DataBundle– DataBundle containing MDF data.
See also
readRead original marine-meteorological data as well as MDF data or CDM tables from disk.
read_mdfRead original marine-meteorological data from disk.
read_tablesRead CDM tables from disk.
writeWrite both MDF data or CDM tables to disk.
write_dataWrite MDF data and validation mask to disk.
write_tablesWrite CDM tables to disk.
- cdm_reader_mapper.read_mdf(source, imodel=None, ext_schema_path=None, ext_schema_file=None, ext_table_path=None, year_init=None, year_end=None, encoding=None, chunksize=None, skiprows=None, convert_flag=True, converter_dict=None, converter_kwargs=None, decode_flag=True, decoder_dict=None, validate_flag=True, sections=None, excludes=None, pd_kwargs=None, xr_kwargs=None)[source]¶
Read data files compliant with a user specific data model.
Reads a data file to a pandas DataFrame using a pre-defined data model. Read data is validates against its data model producing a boolean mask on output.
The data model needs to be input to the module as a named model (included in the module) or as the path to a valid data model.
- Parameters:
source (
str) – The file (including path) to be read.imodel (
str) – Name of internally available input data model, e.g. icoads_r300_d704.ext_schema_path (
strorPath-like, optional) – The path to the external input data model schema file. The schema file must have the same name as the directory. One ofimodelandext_schema_pathorext_schema_filemust be set.ext_schema_file (
strorPath-like, optional) – The external input data model schema file. One ofimodelandext_schema_pathorext_schema_filemust be set.ext_table_path (
strorPath-like, optional) – The path to the external table file. The table file must have the same name as the directory.year_init (
strorint, optional) – Left border of time axis.year_end (
strorint, optional) – Right border of time axis.encoding (
str, optional) – The encoding of the input file. Overrides the value in the imodel schema file.chunksize (
int, optional) – Number of reports per chunk.skiprows (
int, optional) – Number of initial rows to skip from file.convert_flag (
bool, default:True) – If True convert entries by using a pre-defined data model.converter_dict (
dictof{Hashable: func}, optional) – Functions for converting values in specific columns. If None use information from a pre-defined data model.converter_kwargs (
dictof{Hashable: kwargs}, optional) – Key-word arguments for converting values in specific columns. If None use information from a pre-defined data model.decode_flag (
bool, default:True) – If True decode entries by using a pre-defined data model.decoder_dict (
dictof{Hashable: func}, optional) – Functions for decoding values in specific columns. If None use information from a pre-defined data model.validate_flag (
bool, default:True) – Validate data entries by using a pre-defined data model.sections (
list, optional) – List with subset of data model sections to output. If None read pre-defined data model sections.excludes (
strorlistofstr, optional) – MDF Sections to exclude.pd_kwargs (
dict, optional) – Additional pandas arguments.xr_kwargs (
dict, optional) – Additional xarray arguments.
- Return type:
- Returns:
cdm_reader_mapper.DataBundle– DaaBundle containing MDF data.
See also
readRead either original marine-meteorological or MDF data or CDM tables from disk.
read_dataRead MDF data and validation mask from disk.
read_tablesRead CDM tables from disk.
writeWrite either MDF data or CDM tables to disk.
write_dataWrite MDF data and validation mask to disk.
write_tablesWrite CDM tables to disk.
- cdm_reader_mapper.read_tables(source, data_format='parquet', prefix=None, suffix=None, extension=None, separator='-', cdm_subset=None, col_subset=None, delimiter='|', na_values=None, null_label='null', from_str=None, to_str=None, imodel=None, **kwargs)[source]¶
Read CDM-table-like files from file system to a pandas.DataFrame.
- Parameters:
source (
str) – The file (including path) or the path to the file(s) to be read.data_format (
{"csv", "parquet", "feather"}, default:"parquet") – Format of input data file(s).prefix (
str, optional) – Prefix of file name structure:<prefix>-<table>-*<suffix>.<extension>. Could de used if source is a valid directory path.suffix (
str, optional) – Suffix of file name structure:<prefix>-<table>-*<suffix>.<extension>. Could de used if source is a valid directory path.extension (
str, optional) – Extension of file name structure:<prefix>-<table>-*<suffix>.<extension>. Could de used if source is a valid directory path.separator (
str, default:-) – Separator to join the file name pattern components.cdm_subset (
strorlist, optional) – Specifies a subset of tables or a single table.For multiple subsets of tables: This function returns a pandas.DataFrame that is multi-index at the columns, with (table-name, field) as column names. Tables are merged via the report_id field.
For a single table: This function returns a pandas.DataFrame with a simple indexing for the columns.
Required if source is a valid file name.
col_subset (
str,listordict, optional) – Specify the section or sections of the file to read.For multiple sections of the tables: e.g col_subset = {table0:[columns0],…tableN:[columnsN]}
For a single section: e.g. list type object col_subset = [columns] This variable assumes that the column names are all conform to the cdm field names.
delimiter (
str, default:|) – Character or regex pattern to treat as the delimiter while reading with pandas.read_csv.na_values (hashable, iterable of hashable or
dictof{Hashable: Iterable}, optional) – Additional strings to recognize as Na/NaN while reading input file with pandas.read_csv. For more details see: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.htmlnull_label (
str, default:null) – String how to label non valid values in data.from_str (
bool, optional) – If True convert original string data to imodel-specific data types.to_str (
bool, optional) – If True convert original imodel-specific data types to strings.imodel (str , *optional*) – Name of data model, e.g. icoads. Must be set if either from_str or to_str is set.
**kwargs (
Any) – Additional keyword-arguments pass to data reader.
- Return type:
- Returns:
cdm_reader_mapper.DataBundle– DataBundle instance containing successfully read CDM table(s).
See also
readRead either original marine-meteorological data or MDF data or CDM tables from disk.
read_dataRead MDF data and validation mask from disk.
read_mdfRead original marine-meteorological data from disk.
writeWrite either MDF data or CDM tables to disk.
write_tablesWrite CDM tables to disk.
write_dataWrite MDF data and validation mask to disk.
- cdm_reader_mapper.replace_columns(df_l, df_r, pivot_c=None, pivot_l=None, pivot_r=None, rep_c=None, rep_map=None)[source]¶
Replace columns in one DataFrame using row-matching from another.
This function works for both a pd.DataFrame and any Iterable of of pandas DataFrames.
- Parameters:
df_l (
pandas.DataFrameorIterable[pd.dataFrame]) – The left DataFrame whose columns will be replaced.df_r (
pandas.DataFrameorIterable[pd.dataFrame]) – The right DataFrame providing replacement values.pivot_c (
str, optional) – A single pivot column present in both DataFrames. Overrides pivot_l and pivot_r.pivot_l (
str, optional) – Pivot column in df_l. Used only when pivot_c is not supplied.pivot_r (
str, optional) – Pivot column in df_r. Used only when pivot_c is not supplied.rep_c (
strorlistofstr, optional) – One or more column names to replace in df_l. Ignored if rep_map is supplied.rep_map (
dict, optional) – Mapping between left and right column names as {left_col: right_col}.
- Returns:
pd.DataFrameorParquetStreamReader– Updated data with replacements applied.- Raises:
TypeError – If df_l or df_r is not a pandas DataFrame.
If one of pivot_l and pivot_r is not defined. - If rep_map and rep_c is not defined. - If replacement source columns not found in df_r.
Notes
This function logs errors and returns None instead of raising exceptions.
- cdm_reader_mapper.split_by_boolean(data, mask, boolean, reset_index=False, inverse=False, return_rejected=False)[source]¶
Split both a DataFrame and an Iterable of DataFrames using a boolean mask via
split_dataframe_by_boolean.- Parameters:
data (
pd.DataFrameor iterable ofpd.DataFrame) – DataFrame to be split.mask (
pd.DataFrameor iterable ofpd.DataFrame) – Boolean mask with the same length as data.boolean (
bool) – Determines mask interpretation:If True select rows where all mask columns are True.
If False select rows where any mask column is False.
reset_index (
bool, optional) – If True reset the index of returned DataFrames.inverse (
bool, optional) – If True invert the selection performed by the underlying function.return_rejected (
bool, optional) – If True return rejected rows as the second output. If False the rejected output is empty but dtype-preserving.
- Return type:
tuple[DataFrame|ParquetStreamReader,DataFrame|ParquetStreamReader,Index|MultiIndex,Index|MultiIndex]- Returns:
tupleofpd.DataFrameorParquetStreamReaderandpd.DataFrameorParquetStreamReaderandpd.Indexorpd.MultiIndexandpd.Indexorpd.MultiIndex– Selected rows (all mask columns True), rejected rows, original indexes of selection and original indexes of rejection.
- cdm_reader_mapper.split_by_boolean_false(data, mask, reset_index=False, inverse=False, return_rejected=False)[source]¶
Split both a DataFrame or an Iterable of DataFrames where boolean mask is False.
- Parameters:
data (
pd.DataFrameorIterable[pd.DataFrame]) – DataFrame to be split.mask (
pd.DataFrameorIterable[pd.DataFrame]) – Boolean mask with the same length as data.reset_index (
bool, optional) – If True reset indices in returned DataFrames.inverse (
bool, optional) – If True invert the selection.return_rejected (
bool, optional) – If True return rejected rows as the second output. If False the rejected output is empty but dtype-preserving.
- Return type:
tuple[DataFrame|ParquetStreamReader,DataFrame|ParquetStreamReader,Index|MultiIndex,Index|MultiIndex]- Returns:
tupleofpd.DataFrameorParquetStreamReaderandpd.DataFrameorParquetStreamReaderandpd.Indexorpd.MultiIndexandpd.Indexorpd.MultiIndex– Selected rows (all mask columns True), rejected rows, original indexes of selection and original indexes of rejection.
- cdm_reader_mapper.split_by_boolean_true(data, mask, reset_index=False, inverse=False, return_rejected=False)[source]¶
Split both a DataFrame or an Iterable of DataFrames where boolean mask is True.
- Parameters:
data (
pd.DataFrameor iterable ofpd.DataFrame) – DataFrame to be split.mask (
pd.DataFrameor iterable ofpd.DataFrame) – Boolean mask with the same length as data.reset_index (
bool, optional) – If True reset indices in returned DataFrames.inverse (
bool, optional) – If True invert the selection.return_rejected (
bool, optional) – If True return rejected rows as the second output. If False the rejected output is empty but dtype-preserving.
- Return type:
tuple[DataFrame|ParquetStreamReader,DataFrame|ParquetStreamReader,Index|MultiIndex,Index|MultiIndex]- Returns:
tupleofpd.DataFrameorParquetStreamReaderandpd.DataFrameorParquetStreamReaderandpd.Indexorpd.MultiIndexandpd.Indexorpd.MultiIndex– Selected rows (all mask columns True), rejected rows, original indexes of selection and original indexes of rejection.
- cdm_reader_mapper.split_by_column_entries(data, selection, reset_index=False, inverse=False, return_rejected=False)[source]¶
Split both a DataFrame or an Iterable of DataFrames based on matching values in a given column.
- Parameters:
data (
pd.DataFrameor iterable ofpd.DataFrame) – DataFrame to be split.selection (
dict) – Mapping of a column name to an iterable of allowed values. Example: {“city”: [“London”, “Berlin”]}.reset_index (
bool, optional) – Whether to reset index in returned DataFrames.inverse (
bool, optional) – If True invert the selection.return_rejected (
bool, optional) – If True return rejected rows as the second output. If False the rejected output is empty but dtype-preserving.
- Return type:
tuple[DataFrame|ParquetStreamReader,DataFrame|ParquetStreamReader,Index|MultiIndex,Index|MultiIndex]- Returns:
tupleofpd.DataFrameorParquetStreamReaderandpd.DataFrameorParquetStreamReaderandpd.Indexorpd.MultiIndexandpd.Indexorpd.MultiIndex– Selected rows (all mask columns True), rejected rows, original indexes of selection and original indexes of rejection.
- cdm_reader_mapper.split_by_index(data, index, reset_index=False, inverse=False, return_rejected=False)[source]¶
Split both a DataFrame or an Iterable of DataFrames by selecting specific index labels.
- Parameters:
data (
pd.DataFrameor iterable ofDataFrame) – DataFrame to be split.index (
labelorsequenceoflabels) – Index values to select.reset_index (
bool, optional) – If True reset index in returned DataFrames.inverse (
bool, optional) – If True select rows not in index.return_rejected (
bool, optional) – If True return rejected rows as the second output. If False the rejected output is empty but dtype-preserving.
- Return type:
tuple[DataFrame|ParquetStreamReader,DataFrame|ParquetStreamReader,Index|MultiIndex,Index|MultiIndex]- Returns:
tupleofpd.DataFrameorParquetStreamReaderandpd.DataFrameorParquetStreamReaderandpd.Indexorpd.MultiIndexandpd.Indexorpd.MultiIndex– Selected rows (all mask columns True), rejected rows, original indexes of selection and original indexes of rejection.
- cdm_reader_mapper.unique(data, columns=None)¶
Count unique values per column in a DataFrame or a Iterable of DataFrame.
- Parameters:
data (
pandas.DataFrameorIterable[pd.DataFrame]) – Input dataset.columns (
str,listortuple, optional) – Name(s) of the data column(s) to be selected. If None, all columns are used.
- Return type:
- Returns:
Dict[str | tuple[str,str],int]– Dictionary where each key is a column name, and each value is a dictionary mapping unique values (including NaN as ‘nan’) to their counts.
Notes
Works with large files via ParquetStreamReader by iterating through chunks.
- cdm_reader_mapper.validate_datetime(data, imodel, blank=False, log_level='INFO')[source]¶
Validate datetime columns in a dataset according to the specified model.
- Parameters:
data (
pd.DataFrame,pd.Series, orIterable[pd.DataFrame,pd.Series]) – Input dataset or series containing ID values.imodel (
str) – Name of internally available data model, e.g., “icoads_r300_d201”.blank (
bool, optional) – If True, empty values are considered valid. Default is False.log_level (
str, optional) – Logging level. Default is “INFO”.
- Return type:
- Returns:
pd.SeriesorNone– Boolean Series indicating whether each ID is valid. Returns None if validation cannot be performed due to missing data, columns, or deck definitions.- Raises:
TypeError – If data is not a pd.DataFrame or a pd.Series or an Iterable[pd.DataFrame | pd.Series].
ValueError – If no columns found for datetime conversion.
- cdm_reader_mapper.validate_id(data, imodel, blank=False, log_level='INFO')[source]¶
Validate ID column(s) in a dataset against deck-specific patterns.
- Parameters:
data (
pd.DataFrame,pd.Series, orIterable[pd.DataFrame,pd.Series]) – Input dataset or series containing ID values.imodel (
str) – Name of internally available data model, e.g., “icoads_r300_d201”.blank (
bool, optional) – If True, empty values are considered valid. Default is False.log_level (
str, optional) – Logging level. Default is “INFO”.
- Return type:
- Returns:
pd.SeriesorNone– Boolean Series indicating whether each ID is valid. Returns None if validation cannot be performed due to missing data, columns, or deck definitions.- Raises:
TypeError – If data is not a pd.DataFrame or a pd.Series or an Iterable[pd.DataFrame | pd.Series].
Value Error – If dataset imodel has no deck information. If no ID conversion columns found. If input deck is not defined in ID library files.
FilenotFounderror – If dataset imodel has no ID deck library.
Notes
Uses _get_id_col to determine which column(s) contain IDs.
Uses _get_patterns to get regex patterns for the deck.
Empty values match “^$” pattern if blank=True.
- cdm_reader_mapper.write(data, mode='data', **kwargs)[source]¶
Write either MDF data or CDM tables on disk.
- Parameters:
data (
pandas.DataFrameorIterable[pd.DataFrame]) – Data to export.mode (
str,{data, tables}, default:data) –Write data mode:
“data” to write MDF data to disk
“tables” to write CDM tables to disk. Map MDF data to CDM tables with
DataBundle.map_model().
**kwargs (
Any) – Additional key-word arguments used to write data on disk.
See also
write_dataWrite MDF data and validation mask to disk.
write_tablesWrite CDM tables to disk.
readRead either original marine-meteorological data or MDF data or CDM tables from disk.
read_mdfRead original marine-meteorological data from disk.
read_dataRead MDF data and validation mask from disk.
read_tablesRead CDM tables from disk.
- Return type:
Notes
kwargs are the keyword arguments for the specific mode reader.
- cdm_reader_mapper.write_data(data, mask=None, data_format='parquet', dtypes=None, parse_dates=False, encoding='utf-8', out_dir='.', prefix=None, suffix=None, extension=None, filename=None, separator='_', col_subset=None, delimiter=',', **kwargs)[source]¶
Write pandas.DataFrame to MDF file on file system.
- Parameters:
data (
pandas.DataFrameorIterable[pd.DataFrame]) – Data to export.mask (
pandas.DataFrameorIterable[pd.DataFrame], optional) – Validation mask to export.data_format (
{"csv", "parquet", "feather"}, default:"parquet") – Format of output data file(s).dtypes (
dict, optional) – Dictionary of data types on data. Dump dtypes and parse_dates to json information file.parse_dates (
list | bool, default:False) – Information of how to parse dates indata. Dump dtypes and parse_dates to json information file. For more information seepandas.read_csv().encoding (
str, default:"utf-8") – A string representing the encoding to use in the output file, defaults to utf-8.out_dir (
str, default:".") – Path to the output directory.prefix (
str, optional) – Prefix of file name structure: <prefix>-data-*<suffix>.<extension>.suffix (
str, optional) – Suffix of file name structure: <prefix>-data-*<suffix>.<extension>.extension (
str, optional) – Extension of file name structure: <prefix>-data-*<suffix>.<extension>. By default, extension depends on data_format.filename (
strordict, optional) – Name of the output file name(s). List one filename for both data and mask ({“data”:<filenameD>, “mask”:<filenameM>}). By default, automatically create file name from table name, prefix and suffix.separator (
str, optional) – Separator to join the file name pattern components (default “_”).col_subset (
str,tupleorlist, optional) – Specify the section or sections of the file to write.For multiple sections of the tables: e.g col_subset = [columns0,…,columnsN]
For a single section: e.g. list type object col_subset = [columns]
Column labels could be both string or tuple.
delimiter (
str, default:",") – Character or regex pattern to treat as the delimiter while reading with df.to_csv.**kwargs (
Any) – Additional keyword-arguments passed to to_csv when data_format is ‘csv’.
- Raises:
ValueError – If data_foramt is not one of ‘csv’, ‘parquet’ or ‘feather’. If type of data and type of mask do not match.
See also
writeWrite either MDF data or CDM tables to disk.
write_tablesWrite CDM tables to disk.
readRead either original marine-meteorological data or MDF data or CDM tables from disk.
read_dataRead MDF data and validation mask from disk.
read_mdfRead original marine-meteorological data from disk.
read_tablesRead CDM tables from disk.
Notes
Use this function after reading MDF data.
- Return type:
- cdm_reader_mapper.write_tables(data, data_format='parquet', out_dir=None, prefix=None, suffix=None, extension=None, filename=None, separator='-', cdm_subset=None, col_subset=None, delimiter='|', encoding='utf-8', from_str=None, to_str=None, imodel=None, **kwargs)[source]¶
Write pandas.DataFrame to CDM-table file on file system.
- Parameters:
data (
pandas.DataFrame) – Data to export.data_format (
{"csv", "parquet", "feather"}, default:"parqeut") – Format of output data file(s).out_dir (
str, optional) – Path to the output directory. Defaults to current directory.prefix (
str, optional) – Prefix of file name structure:<prefix><separator><table><separator>*<suffix>.<extension>.suffix (
str, optional) – Suffix of file name structure:<prefix><separator><table><separator>*<suffix>.<extension>.extension (
str, optional) – Extension of file name structure:<prefix><separator><table><separator>*<suffix>.<extension>.filename (
str,Path-likeordict, optional) – Name of the output file name(s). List one filename for each table name indata({<table>:<filename>}). If None, automatically create file name from table name,prefixandsuffix.separator (
str, optional) – Separator of file name structure:<prefix><separator><table><separator>*<suffix>.<extension>.cdm_subset (
strorlistofstr, optional) – Specifies a subset of tables or a single table.For multiple subsets of tables: This function returns a pandas.DataFrame that is multi-index at the columns, with (table-name, field) as column names. Tables are merged via the report_id field.
For a single table: This function returns a pandas.DataFrame with a simple indexing for the columns.
col_subset (
str,listordict, optional) – Specify the section or sections of the file to write.For multiple sections of the tables: e.g col_subset = {table0:[columns0],…tableN:[columnsN]}
For a single section: e.g. list type object col_subset = [columns] This variable assumes that the column names are all conform to the cdm field names.
delimiter (
str, default:"|") – Character or regex pattern to treat as the delimiter while reading with df.to_csv. This is only relevant if data_format is “csv”.encoding (
str) – A string representing the encoding to use in the output file, defaults to utf-8. This is only relevant if data_format is “csv”.from_str (
bool, optional) – If True convert original string data to imodel-specific data types.to_str (
bool, optional) – If True convert original imodel-specific data types to strings.imodel (str , *optional*) – Name of data model, e.g. icoads. Must be set if either from_str or to_str is set.
**kwargs (
Any) – Additional keyword-arguments that will be ignored.
See also
writeWrite either MDF data or CDM tables to disk.
write_dataWrite MDF data and validation mask to disk.
readRead either original marine-meteorological data or MDF data or CDM tables from disk.
read_tablesRead CDM tables from disk.
read_dataRead MDF data and validation mask from disk.
read_mdfRead original marine-meteorological data from disk.
- Return type:
Notes
Use this function after reading CDM tables.
kwargs will be ignored!
Subpackages¶
- cdm_reader_mapper.cdm_mapper package
- Subpackages
- Submodules
- cdm_reader_mapper.cdm_mapper.mapper module
- cdm_reader_mapper.cdm_mapper.properties module
- cdm_reader_mapper.cdm_mapper.reader module
- cdm_reader_mapper.cdm_mapper.writer module
- cdm_reader_mapper.common package
count_by_cat()get_filename()get_length()load_file()replace_columns()split_by_boolean()split_by_boolean_false()split_by_boolean_true()split_by_column_entries()split_by_index()- Submodules
- cdm_reader_mapper.common.getting_files module
- cdm_reader_mapper.common.inspect module
- cdm_reader_mapper.common.io_files module
- cdm_reader_mapper.common.iterators module
- cdm_reader_mapper.common.json_dict module
- cdm_reader_mapper.common.logging_hdlr module
- cdm_reader_mapper.common.replace module
- cdm_reader_mapper.common.select module
- cdm_reader_mapper.core package
- Submodules
- cdm_reader_mapper.core._utilities module
- cdm_reader_mapper.core.databundle module
DataBundleDataBundle.add()DataBundle.columnsDataBundle.copy()DataBundle.correct_datetime()DataBundle.correct_pt()DataBundle.dataDataBundle.dtypesDataBundle.duplicate_check()DataBundle.encodingDataBundle.flag_duplicates()DataBundle.get_duplicates()DataBundle.imodelDataBundle.map_model()DataBundle.maskDataBundle.modeDataBundle.parse_datesDataBundle.remove_duplicates()DataBundle.replace_columns()DataBundle.select_where_all_false()DataBundle.select_where_all_true()DataBundle.select_where_entry_isin()DataBundle.select_where_index_isin()DataBundle.split_by_boolean_false()DataBundle.split_by_boolean_true()DataBundle.split_by_column_entries()DataBundle.split_by_index()DataBundle.stack_h()DataBundle.stack_v()DataBundle.unique()DataBundle.validate_datetime()DataBundle.validate_id()DataBundle.write()
- cdm_reader_mapper.core.reader module
- cdm_reader_mapper.core.writer module
- cdm_reader_mapper.data package
- cdm_reader_mapper.duplicates package
- cdm_reader_mapper.mdf_reader package
- cdm_reader_mapper.metmetpy package
correct_datetime()correct_pt()validate_datetime()validate_id()- Subpackages
- Submodules
- cdm_reader_mapper.metmetpy.correct module
- cdm_reader_mapper.metmetpy.properties module
- cdm_reader_mapper.metmetpy.validate module
Submodules¶
cdm_reader_mapper.properties module¶
Common Data Model (CDM) reader and mapper common properties.