cdm_reader_mapper.cdm_mapper package¶

Climate Data Model (CDM) mapper package.

Subpackages¶

Submodules¶

cdm_reader_mapper.cdm_mapper.mapper module¶

Map Common Data Model (CDM).

Created on Thu Apr 11 13:45:38 2019

Maps data contained in a pandas DataFrame (or Iterable[pd.DataFrame]) to the C3S Climate Data Store Common Data Model (CDM) header and observational tables using the mapping information available in the tool’s mapping library for the input data model.

@author: iregon

cdm_reader_mapper.cdm_mapper.mapper.map_model(data, imodel, cdm_subset=None, codes_subset=None, cdm_complete=True, drop_missing_obs=True, drop_duplicates=True, log_level='INFO')[source]¶

Map a pandas DataFrame to the CDM header and observational tables.

Parameters:

data (pandas.DataFrame or Iterable[pd.DataFrame]) – Input data to map.
imodel (str) – A specific mapping from generic data model to CDM, like map a SID-DCK from IMMA1’s core and attachments to CDM in a specific way, e.g. icoads_r300_d704.
cdm_subset (str or list, optional) – Subset of CDM model tables to map. Defaults to the full set of CDM tables defined for the imodel.
codes_subset (str or list, optional) – Subset of code mapping tables to map. Default to the full set of code mapping tables defined for the imodel.
cdm_complete (bool, default: True) – If True map entire CDM tables list.
drop_missing_obs (bool, default: True) – If True Drop observations without a valid observation value, e.g. no air_temperature value.
drop_duplicates (bool, default: True) – If True drop duplicated rows.
log_level (str, default: INFO) – Level of logging information to save.

Return type:

DataFrame | ParquetStreamReader

Returns:

cdm_tables (pandas.DataFrame) – DataFrame with MultiIndex columns (cdm_table, column_name).

Raises:

ValueError –
- If imodel is not defined. - If first split entry (‘_’) of imodel is not defined. - If mapping does not return a DataFame.
TypeError –
- If type of imodel is not supported. - If anything during mapping fails.

cdm_reader_mapper.cdm_mapper.properties module¶

Common Data Model (CDM) mapper properties.

cdm_reader_mapper.cdm_mapper.reader module¶

Read Common Data Model (CDM) mapping tables.

Created on Thu Apr 11 13:45:38 2019

Reads files with the CDM table format from a file system to a pandas.Dataframe.

All CDM fields are read as objects. Null values are read with the specified null value in the table files, or as NaN if the na_values argument is set to the a specific null value in the file.

Reads the full set of files (default), a subset or a single table, as controlled by cdm_subset:

When reading multiple tables, the resulting dataframe is multi-indexed in
the columns, with (table-name, field) as column names. Merging of tables occurs on the report_id field.

When reading a single table, the resulting dataframe has simple indexing
in the columns.

Reads the full set of fields (default) or a subset of it, as controlled by param col_subset:

When reading multiple tables (default or subset), the col_subset is a
dictionary like: col_subset = {table0:[columns],…tablen:[columns]} If a table is not specified in col_subset, all its fields are read.

When reading a single table, the col_subset is a list like:
col_subset = [columns]

It is assumed that the column names are all conform to the cdm field names

The full table set (header, observations-“*”) is assumed to be in the same directory.

Filenames for tables are assumed to be:: tableName-<tb_id>.<extension>
with:: valid tableName: as declared in properties.cdm_tables tb_id: any identifier including wildcards if required extension: defaulting to ‘psv’

When specifying a subset of tables, valid names are those in properties.cdm_tables

@author: iregon

cdm_reader_mapper.cdm_mapper.reader.read_tables(source, data_format='parquet', prefix=None, suffix=None, extension=None, separator='-', cdm_subset=None, col_subset=None, delimiter='|', na_values=None, null_label='null', from_str=None, to_str=None, imodel=None, **kwargs)[source]¶

Read CDM-table-like files from file system to a pandas.DataFrame.

Parameters:

source (str) – The file (including path) or the path to the file(s) to be read.
data_format ({"csv", "parquet", "feather"}, default: "parquet") – Format of input data file(s).
prefix (str, optional) – Prefix of file name structure: <prefix>-<table>-*<suffix>.<extension>. Could de used if source is a valid directory path.
suffix (str, optional) – Suffix of file name structure: <prefix>-<table>-*<suffix>.<extension>. Could de used if source is a valid directory path.
extension (str, optional) – Extension of file name structure: <prefix>-<table>-*<suffix>.<extension>. Could de used if source is a valid directory path.
separator (str, default: -) – Separator to join the file name pattern components.
cdm_subset (str or list, optional) – Specifies a subset of tables or a single table.
- For multiple subsets of tables: This function returns a pandas.DataFrame that is multi-index at the columns, with (table-name, field) as column names. Tables are merged via the report_id field.
- For a single table: This function returns a pandas.DataFrame with a simple indexing for the columns.
Required if source is a valid file name.
col_subset (str, list or dict, optional) – Specify the section or sections of the file to read.
- For multiple sections of the tables: e.g col_subset = {table0:[columns0],…tableN:[columnsN]}
- For a single section: e.g. list type object col_subset = [columns] This variable assumes that the column names are all conform to the cdm field names.
delimiter (str, default: |) – Character or regex pattern to treat as the delimiter while reading with pandas.read_csv.
na_values (hashable, iterable of hashable or dict of {Hashable: Iterable}, optional) – Additional strings to recognize as Na/NaN while reading input file with pandas.read_csv. For more details see: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
null_label (str, default: null) – String how to label non valid values in data.
from_str (bool, optional) – If True convert original string data to imodel-specific data types.
to_str (bool, optional) – If True convert original imodel-specific data types to strings.
imodel (str , *optional*) – Name of data model, e.g. icoads. Must be set if either from_str or to_str is set.
**kwargs (Any) – Additional keyword-arguments pass to data reader.

Return type:

DataBundle

Returns:

cdm_reader_mapper.DataBundle – DataBundle instance containing successfully read CDM table(s).

cdm_reader_mapper.cdm_mapper.writer module¶

Write Common Data Model (CDM) mapping tables.

Created on Thu Apr 11 13:45:38 2019

Exports tables written in the C3S Climate Data Store Common Data Model (CDM) format to ascii files, The tables format is contained in a python dictionary, stored as an attribute in a pandas.DataFrame (or Iterable[pd.DataFrame]).

This module uses a set of printer functions to “print” element values to a string object before exporting them to a final ascii file.

Each of the CDM table element’s has a data type (pseudo-sql as defined in the CDM documentation) which defines which printer function needs to be used.

Numeric data types are printed with an specific number of decimal places, defined in the data element attributes. This can vary according to each CDM, element, imodel and mapping .json file. If this is not defined in the input attributes of the imodel, the number of decimal places used comes from a default tool defined in properties.py

@author: iregon

cdm_reader_mapper.cdm_mapper.writer.write_tables(data, data_format='parquet', out_dir=None, prefix=None, suffix=None, extension=None, filename=None, separator='-', cdm_subset=None, col_subset=None, delimiter='|', encoding='utf-8', from_str=None, to_str=None, imodel=None, **kwargs)[source]¶

Write pandas.DataFrame to CDM-table file on file system.

Parameters:

data (pandas.DataFrame) – Data to export.
data_format ({"csv", "parquet", "feather"}, default: "parqeut") – Format of output data file(s).
out_dir (str, optional) – Path to the output directory. Defaults to current directory.
prefix (str, optional) – Prefix of file name structure: <prefix><separator><table><separator>*<suffix>.<extension>.
suffix (str, optional) – Suffix of file name structure: <prefix><separator><table><separator>*<suffix>.<extension>.
extension (str, optional) – Extension of file name structure: <prefix><separator><table><separator>*<suffix>.<extension>.
filename (str, Path-like or dict, optional) – Name of the output file name(s). List one filename for each table name in data ({<table>:<filename>}). If None, automatically create file name from table name, prefix and suffix.
separator (str, optional) – Separator of file name structure: <prefix><separator><table><separator>*<suffix>.<extension>.
cdm_subset (str or list of str, optional) – Specifies a subset of tables or a single table.
- For multiple subsets of tables: This function returns a pandas.DataFrame that is multi-index at the columns, with (table-name, field) as column names. Tables are merged via the report_id field.
- For a single table: This function returns a pandas.DataFrame with a simple indexing for the columns.
col_subset (str, list or dict, optional) – Specify the section or sections of the file to write.
- For multiple sections of the tables: e.g col_subset = {table0:[columns0],…tableN:[columnsN]}
- For a single section: e.g. list type object col_subset = [columns] This variable assumes that the column names are all conform to the cdm field names.
delimiter (str, default: "|") – Character or regex pattern to treat as the delimiter while reading with df.to_csv. This is only relevant if data_format is “csv”.
encoding (str) – A string representing the encoding to use in the output file, defaults to utf-8. This is only relevant if data_format is “csv”.
from_str (bool, optional) – If True convert original string data to imodel-specific data types.
to_str (bool, optional) – If True convert original imodel-specific data types to strings.
imodel (str , *optional*) – Name of data model, e.g. icoads. Must be set if either from_str or to_str is set.
**kwargs (Any) – Additional keyword-arguments that will be ignored.