Generating a data model for CLIWOC¶
The purpose of this notebook is to demonstrate the structure of data models used by the cdm_reader_mapper toolbox.
ICOADS IMMA¶
A common format for marine observational records is the ICOADS IMMA format. This is a text format, where each line contains the data (including metadata) for an individual record. The format is attachment based, each record is constructed from a selection of (typically) fixed-width sections (called attachments) containing different subsets of the data or metadata associated with the record. Documentation on the format, and the available attachments can be found at https://icoads.noaa.gov/e-doc/imma/R3.0-imma1.pdf.
Records within the same file can contain different attachments, meaning that the IMMA format is not a fixed-width format, as line lengths will vary between records. Each record, however, must contain a certain subset of the attachments (in this case the core (or c0), c1, and c98 attachments).
Supplementary Data¶
Additional data or metadata can be provided in the c99 attachment. This attachment is not fixed-width as different sources or decks can provide different collections of supplementary data.
CLIWOC¶
In this example we use a subset of ICOADS release 3.0.0 IMMA formatted data for deck 730, which is data from the Climatological Database for the World’s Oceans (CLIWOC). There is a large amount of supplementary data available in the c99 attachment, which for deck 730 can be split into multiple sections. Here, we will start with the standard schema for the ICOADS IMMA format (included in cdm_reader_mapper as the "icoads" imodel), and extend the schema with fields for a subset of the c99 attachment. We will add fields for the logbook section of the c99 attachment for this deck.
An internal schema already exists for this deck ("icoads_r300_d730"), the purpose of this notebook is to demonstrate how one can extend the "icoads" data model to parse c99 data.
Overview¶
An initial read of the data subset using the
"icoads"data model which does not parse thec99attachment.Extension of the
"icoads"schema to add fields for the logbook section of thec99attachment for deck 730.Construction of a code table for a categorical field in the
c99attachment.Comparison with the internal schema for deck 730.
from __future__ import annotations
import json
import shutil
import warnings
import pandas as pd
from cdm_reader_mapper import read_mdf, test_data
from cdm_reader_mapper.mdf_reader.properties import _base as base
try:
from importlib.resources import files as get_files
except ImportError:
from importlib_resources import files as get_files
import pathlib
from collections import OrderedDict
from tempfile import TemporaryDirectory
warnings.filterwarnings("ignore")
The Data¶
For this example we load a subset of ICOADS data for deck 730 from the cdm_reader_mapper test data. This is the data that will be used throughout this notebook.
data_file_path = test_data.test_icoads_r300_d730["source"]
Initial Read¶
First we read the data using the basic "icoads" data model. This isn’t necessary for extending the schema, it is to highlight the raw c99 data.
data_bundle = read_mdf(data_file_path, imodel="icoads")
data_raw = data_bundle.data
WARNING:root:Unknown column_type 'object' for column '('c8', 'PUID')'
WARNING:root:Unknown column_type 'object' for column '('c95', 'ARCR')'
WARNING:root:Unknown column_type 'object' for column '('c96', 'ARCI')'
WARNING:root:Unknown column_type 'object' for column '('c97', 'ARCE')'
Supplementary (c99) data¶
By looking at the c99 section we can see that the supplementary data has not been parsed.
data_raw["c99"].head()
0 99 0 AGI ARCHIVO GENERAL DE INDIAS ...
1 99 0 CARAN CENTRE D'ACCUEIL ET DE RECHERCHE ...
2 99 0 RAZ RIJKSARCHIEF ZEELAND ...
3 99 0 NMM NATIONAL MARITIME MUSEUM ...
4 99 0 AGI ARCHIVO GENERAL DE INDIAS ...
Name: c99, dtype: object
data_raw["c99"].iloc[3]
'99 0 NMM NATIONAL MARITIME MUSEUM GREENWICH UNITED KINGDOM NMM ADM/L/R13 ENGLISH 0492500N 405000E 1 1BERMUDA LIZARD N87:17E 230 0 21771100112TUESDAY 12 RAINBOW BRITISH 5TH RATE RN THOMAS COLLINGWOOD CAPTAIN CHARLES WARREN 2ND OFFICER/LIEUTENANT BERMUDA SPITHEAD 0 17710829S25E 39.00 UNKNOWN UNKNOWN -22 LEAGUESNM 180 DEGREES ESE, E FRESH GALES AND SQUALLY 00000000CLIWOC VERSION 2.0'
Creating a data model¶
Custom Schema¶
To use a custom schema we need to use the ext_schema_path argument in read_mdf. The structure of the directory is:
name_of_model/
name_of_model.json
code_tables/
...
The code_tables sub-directory contains the code tables that map the key columns in the data to their values.
In this example we create a temporary directory for the data model, so that it is cleaned up after the notebook is finished; in reality you would want to store the data model in a permanent directory!
We start from the basic "icoads" model. The c99 section will be based on the "icoads_r300_d730" schema and code tables.
Copy the "icoads" schema¶
First we create a copy of the "icoads" schema (located at mdf_reader/schemas/icoads/icoads.json). NOTE: cdm_reader_mapper.mdf_reader.properties._base is used so that we have a relative path to the original schema and code tables.
tmp_dir = TemporaryDirectory()
my_model_name = "cliwoc"
my_model_path = pathlib.Path(tmp_dir.name) / my_model_name
my_model_path.mkdir(exist_ok=True)
# Get a copy of the "imma1" schema
icoads_schema_path = icoads_code_tables_path = get_files(f"{base}.schemas.icoads")
icoads_schema_path = pathlib.Path(icoads_schema_path) / "icoads.json"
my_schema_path = my_model_path / (my_model_name + ".json")
copy = shutil.copyfile(icoads_schema_path, my_schema_path)
Copy the code tables¶
We now copy each of the "icoads" code tables. This includes generic icoads code tables (located in mdf_reader/codes/icoads).
# Get code tables and copy to the directory
my_code_tables_path = my_model_path / "code_tables"
my_code_tables_path.mkdir(exist_ok=True)
# Original code table directories (general ICOADS and Deck specific)
icoads_code_tables_path = get_files(f"{base}.codes.icoads")
# Get filenames for each of the code tables
code_table_files = list(icoads_code_tables_path.glob("ICOADS.*.json"))
# Copy each file
for file in code_table_files:
basename = pathlib.Path(file).name
out_path = my_code_tables_path / basename
shutil.copyfile(file, out_path)
Extending the schema: CLIWOC logbook information¶
For this example we’ll load the schema into the environment as a dictionary (we use an ordered dictionary to guarantee that the ordering of the fields is maintained!).
with pathlib.Path(my_schema_path).open() as io:
schema = json.load(io, object_pairs_hook=OrderedDict)
We now add the contents for section c99. There are some standard (“header”_ fields we need to supply. The "sentinal" is the prefix for the attachment, this is printed in the raw supplementary data and identifies the start of the attachment.
We also need to specify the length of the attachment and the layout.
We then add our data fields to the elements field for the c99 section. We’ll add the fields for the logbook component of the supplementary data for CLIWOC data, there are additional components we can resolve but we’ll keep it to the logbook for this example.
schema["sections"]["c99"]["header"]["sentinal"] = "99 0 "
schema["sections"]["c99"]["header"]["disable_read"] = False
schema["sections"]["c99"]["header"]["field_layout"] = "fixed_width"
schema["sections"]["c99"]["header"]["length"] = 245 + 5 # Sentinal length
schema["sections"]["c99"]["elements"] = OrderedDict(
{
"sentinal": {
"description": "attachment sentinal",
"field_length": 5,
"column_type": "str",
"ignore": True,
},
"InstAbbr": {
"description": "Abbreviation of the Institute storing the original data",
"field_length": 8,
"column_type": "str",
},
"InstName": {
"description": "Full name of the Institute storing the original data",
"field_length": 50,
"column_type": "str",
},
"InstCity": {
"description": "City where the Institute storing the data is located",
"field_length": 10,
"column_type": "str",
},
"InstCountry": {
"description": "Country where the Institute storing the data is located",
"field_length": 14,
"column_type": "str",
},
"ArchiveID": {
"description": "Administrative number under which the data is found within the Institute storing the data",
"field_length": 15,
"column_type": "str",
},
"ArchiveName": {
"description": "Administrative name under which the data is found within the Institute storing the data",
"field_length": 17,
"column_type": "str",
},
"ArchivePart": {
"description": "Part of the archive set in which the data is found within the Institute storing the data",
"field_length": 39,
"column_type": "str",
},
"ArchivePartSpec": {
"description": "Specification of the part of the archive set in which the data is found within the Institute storing the data",
"field_length": 31,
"column_type": "str",
},
"LogbookID": {
"description": "Identificaion Number of the logbook containing the data",
"field_length": 30,
"column_type": "str",
},
"LogbookLang": {
"description": "Language of the logbook containing the data",
"field_length": 7,
"column_type": "str",
},
"ImageID": {
"description": "Identificaion Number of the original image of the logbook",
"field_length": 23,
"column_type": "str",
},
"IllustrationAvail": {
"description": "Illustration available on the current page of the logbook",
"field_length": 1,
"column_type": "key",
"codetable": "CLIWOC_ILLUSTRATION_I",
},
}
)
We can now write the dictionary to the schema file.
json_object = json.dumps(schema, indent=2)
with pathlib.Path(my_schema_path).open("w") as outfile:
outfile.write(json_object)
ImageAvail Code Table¶
One of the fields we have added has "column_type" of "key". This is used to indicate categorical data, where the key value maps to a larger descriptive value. We also specified a code table for this field, which should describe that mapping. Let’s create that table now. As with the schema it should be json formatted.
For this field, we have two possible values. We save the dictionary to a json file in the code_tables directory, the name of the file must match the "codetable" value for the field (plus the ".json" extension).
illustration_avail_codes = {
"0": "No illustration on the current logbook page.",
"1": "Illustration available on the current logbook page.",
}
illustration_avail_path = my_code_tables_path / "CLIWOC_ILLUSTRATION_I.json"
json_object = json.dumps(illustration_avail_codes, indent=2)
with pathlib.Path(illustration_avail_path).open("w") as outfile:
outfile.write(json_object)
Reading¶
We can now read the data file with the schema we have just created (copied…). We specify the path to the data model (the directory containing the schema json file) and the path to the code tables.
my_bundle = read_mdf(
data_file_path, # Path to the data file
ext_schema_path=my_model_path, # Path to the directory containing the schema json file
ext_table_path=my_code_tables_path, # Path to the directory containing the json code tables
)
my_data = my_bundle.data
ERROR:root:imodel is not defined.
Analysing the output¶
We can now investigate components of the c99 section.
my_data[["c99"]].head()
| c99 | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| InstAbbr | InstName | InstCity | InstCountry | ArchiveID | ArchiveName | ArchivePart | ArchivePartSpec | LogbookID | LogbookLang | ImageID | IllustrationAvail | |
| 0 | AGI | ARCHIVO GENERAL DE INDIAS | SEVILLE | SPAIN | NaN | None | NaN | NaN | CORREOS, 275A R11 | SPANISH | NaN | 0 |
| 1 | CARAN | CENTRE D'ACCUEIL ET DE RECHERCHE DES ARCH. NAT... | PARIS | FRANCE | NaN | None | NaN | NaN | COTE - 4/JJ/39 | FRENCH | NaN | 0 |
| 2 | RAZ | RIJKSARCHIEF ZEELAND | MIDDELBURG | NEDERLAND | 20 | None | MCC | 1391 | MCC_20_1391 | DUTCH | MCC_20_1391_0032 | 0 |
| 3 | NMM | NATIONAL MARITIME MUSEUM | GREENWICH | UNITED KINGDOM | NaN | None | NaN | NaN | NMM ADM/L/R13 | ENGLISH | NaN | 0 |
| 4 | AGI | ARCHIVO GENERAL DE INDIAS | SEVILLE | SPAIN | NaN | None | NaN | NaN | CORREOS, 193B R3 | SPANISH | NaN | 0 |
my_data[["c99"]].describe(include="all")
| c99 | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| InstAbbr | InstName | InstCity | InstCountry | ArchiveID | ArchiveName | ArchivePart | ArchivePartSpec | LogbookID | LogbookLang | ImageID | IllustrationAvail | |
| count | 5 | 5 | 5 | 5 | 1 | 0 | 1 | 1 | 5 | 5 | 1 | 5 |
| unique | 4 | 4 | 4 | 4 | 1 | 0 | 1 | 1 | 5 | 4 | 1 | 1 |
| top | AGI | ARCHIVO GENERAL DE INDIAS | SEVILLE | SPAIN | 20 | NaN | MCC | 1391 | CORREOS, 275A R11 | SPANISH | MCC_20_1391_0032 | 0 |
| freq | 2 | 2 | 2 | 2 | 1 | NaN | 1 | 1 | 1 | 2 | 1 | 5 |
Internal Schema¶
cdm_reader_mapper already includes a data model for the CLIWOC deck. The model parses all sections of supplementary data and provides all required code tables. Let’s now read in the data using the "icoads_r300_d730" model.
all_data = read_mdf(
data_file_path,
imodel="icoads_r300_d730",
)
WARNING:root:Unknown column_type 'object' for column '('c8', 'PUID')'
WARNING:root:Unknown column_type 'object' for column '('c95', 'ARCR')'
WARNING:root:Unknown column_type 'object' for column '('c96', 'ARCI')'
WARNING:root:Unknown column_type 'object' for column '('c97', 'ARCE')'
WARNING:root:Unknown column_type 'object' for column '('c99_sentinel', 'BLK')'
The c99 section has been split into multiple sections. There is no c99 section in the output, however we now have:
c99_logbookc99_voyagec99_data
We can compare the c99_logbook section to the output of our model. We see that we have extracted the same data, although we chose different column names for the elements.
all_data.data[["c99_logbook"]].describe(include="all")
| c99_logbook | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| InstAbbr | InstName | InstPlace | InstLand | NumArchiveSet | NameArchiveSet | ArchivePart | Specification | Logbook_id | Logbook_language | Image_No | Illustr | |
| count | 5 | 5 | 5 | 5 | 1 | 0 | 1 | 1 | 5 | 5 | 1 | 5 |
| unique | 4 | 4 | 4 | 4 | 1 | 0 | 1 | 1 | 5 | 4 | 1 | 1 |
| top | AGI | ARCHIVO GENERAL DE INDIAS | SEVILLE | SPAIN | 20 | NaN | MCC | 1391 | CORREOS, 275A R11 | SPANISH | MCC_20_1391_0032 | 0 |
| freq | 2 | 2 | 2 | 2 | 1 | NaN | 1 | 1 | 1 | 2 | 1 | 5 |
my_data[["c99"]].describe(include="all")
| c99 | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| InstAbbr | InstName | InstCity | InstCountry | ArchiveID | ArchiveName | ArchivePart | ArchivePartSpec | LogbookID | LogbookLang | ImageID | IllustrationAvail | |
| count | 5 | 5 | 5 | 5 | 1 | 0 | 1 | 1 | 5 | 5 | 1 | 5 |
| unique | 4 | 4 | 4 | 4 | 1 | 0 | 1 | 1 | 5 | 4 | 1 | 1 |
| top | AGI | ARCHIVO GENERAL DE INDIAS | SEVILLE | SPAIN | 20 | NaN | MCC | 1391 | CORREOS, 275A R11 | SPANISH | MCC_20_1391_0032 | 0 |
| freq | 2 | 2 | 2 | 2 | 1 | NaN | 1 | 1 | 1 | 2 | 1 | 5 |
Additional Sections¶
We can also look at the additional components we did not parse in our model.
We can note some remaining issues with the model as we look at the extra data. Most of the challenges relate to language translations.
pd.options.display.max_columns = None
all_data.data[["c99_voyage"]].describe(include="all")
| c99_voyage | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| drLatDeg | drLatMin | drLatSec | drLatHem | drLonDeg | drLonMin | drLonSec | drLonHem | LatDeg | LatMin | LatSec | LatHem | LonDeg | LonMin | LonSec | LonHem | LatInd | LonInd | ZeroMeridian | LMname1 | LMdirection1 | LMdistance1 | LMname2 | LMdirection2 | LMdistance2 | LMname3 | LMdirection3 | LMdistance4 | PosCoastal | Calendar_type | logbook_date | TimeOB | Day_of_the_week | PartDay | Watch | Glasses | Start_day | ShipName | Nationality | Ship_type | Company | Name1 | Rank1 | Name2 | Rank2 | Name3 | Rank3 | voyage_from | voyage_to | Anchored_ind | AnchorPlace | DASno | VoyageIni | Course_ship | Ship_speed | Distance | EncName | EncNat | |
| count | 4.000000 | 4.000000 | 4.0 | 4 | 2.000000 | 2.00000 | 2.0 | 2 | 2.000000 | 2.000000 | 2.0 | 2 | 3.000000 | 3.000000 | 3.0 | 3 | 5 | 5 | 5 | 1 | 1 | 1.0 | 0 | 0 | 0.0 | 0 | 0 | 0.0 | 5 | 5 | 5 | 5.0 | 1 | 1 | 1 | 1.0 | 5 | 5 | 5 | 4 | 2 | 4 | 4 | 1 | 1 | 0 | 0 | 5 | 5 | 5 | 0 | 0 | 5 | 2 | 0 | 4 | 0 | 0 |
| unique | NaN | NaN | NaN | 2 | NaN | NaN | NaN | 1 | NaN | NaN | NaN | 2 | NaN | NaN | NaN | 2 | 2 | 3 | 4 | 1 | 1 | NaN | 0 | 0 | NaN | 0 | 0 | NaN | 1 | 1 | 1 | NaN | 1 | 1 | 1 | <NA> | 2 | 5 | 4 | 4 | 2 | 4 | 3 | 1 | 1 | 0 | 0 | 5 | 4 | 1 | 0 | 0 | 5 | 2 | 0 | 4 | 0 | 0 |
| top | NaN | NaN | NaN | N | NaN | NaN | NaN | E | NaN | NaN | NaN | N | NaN | NaN | NaN | E | 1 | 2 | TENERIFE | LIZARD | N87:17E | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 2 | 17711001 | NaN | TUESDAY | 3 | VM | <NA> | UNKNOWN | EL COLON | SPANISH | PAQUEBOTE | MCC | THOMAS D'ORVES | CAPITAN | CHARLES WARREN | 2ND OFFICER/LIEUTENANT | NaN | NaN | LA HABANA | LA CORUÑA | 0 | NaN | NaN | 17710819 | WTZ | NaN | 175.00 | NaN | NaN |
| freq | NaN | NaN | NaN | 3 | NaN | NaN | NaN | 2 | NaN | NaN | NaN | 1 | NaN | NaN | NaN | 2 | 3 | 2 | 2 | 1 | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 5 | 5 | 5 | NaN | 1 | 1 | 1 | <NA> | 4 | 1 | 2 | 1 | 1 | 1 | 2 | 1 | 1 | NaN | NaN | 1 | 2 | 5 | NaN | NaN | 1 | 1 | NaN | 1 | NaN | NaN |
| mean | 27.250000 | 24.250000 | 0.0 | NaN | 26.500000 | 36.00000 | 0.0 | NaN | 22.000000 | 9.500000 | 0.0 | NaN | 121.666667 | 42.666667 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | 230.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 12.0 | NaN | NaN | NaN | 8.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| std | 21.884165 | 16.879475 | 0.0 | NaN | 19.091883 | 19.79899 | 0.0 | NaN | 29.698485 | 13.435029 | 0.0 | NaN | 195.208436 | 11.239810 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | NaN | NaN | NaN | <NA> | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| min | 1.000000 | 5.000000 | 0.0 | NaN | 13.000000 | 22.00000 | 0.0 | NaN | 1.000000 | 0.000000 | 0.0 | NaN | 4.000000 | 33.000000 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | 230.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 12.0 | NaN | NaN | NaN | 8.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 25% | 13.750000 | 17.000000 | 0.0 | NaN | 19.750000 | 29.00000 | 0.0 | NaN | 11.500000 | 4.750000 | 0.0 | NaN | 9.000000 | 36.500000 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | 230.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 12.0 | NaN | NaN | NaN | 8.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 50% | 29.500000 | 23.000000 | 0.0 | NaN | 26.500000 | 36.00000 | 0.0 | NaN | 22.000000 | 9.500000 | 0.0 | NaN | 14.000000 | 40.000000 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | 230.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 12.0 | NaN | NaN | NaN | 8.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 75% | 43.000000 | 30.250000 | 0.0 | NaN | 33.250000 | 43.00000 | 0.0 | NaN | 32.500000 | 14.250000 | 0.0 | NaN | 180.500000 | 47.500000 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | 230.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 12.0 | NaN | NaN | NaN | 8.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| max | 49.000000 | 46.000000 | 0.0 | NaN | 40.000000 | 50.00000 | 0.0 | NaN | 43.000000 | 19.000000 | 0.0 | NaN | 347.000000 | 55.000000 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | 230.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 12.0 | NaN | NaN | NaN | 8.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
all_data.data[["c99_voyage"]].c99_voyage.ZeroMeridian.head()
0 TENERIFE
1 GREENWICH
2 NL_0_01
3 BERMUDA
4 TENERIFE
Name: ZeroMeridian, dtype: object
Ship types and languages¶
For example, the ship types on this deck will be given in many different languages. There is no code table for this variable in the CLIWOC website.
all_data.data[["c99_voyage"]].c99_voyage.Ship_type.dropna().head()
0 PAQUEBOTE
2 SNAUW
3 5TH RATE
4 PAQUEBOT
Name: Ship_type, dtype: object
all_data.data[["c99_data"]].c99_data.describe(include="all")
| AT_reading_units | SST_reading_units | AP_reading_units | BART_reading_units | ReferenceCourse | ReferenceWindDirection | Decl | Distance_units | Distance_units_to_landmark | Distance_units_travelled | Longitude_units | units_of_measurement | humidity_units | water_at_pump_units | wind_scale | BARO_type | BARO_brand | API | Humidity_method | compas_error | compas_correction | AT_outside | SST | AP | wind_dir | current_dir | current_speed | attached_tem | pump_water | Humidity | wind_force | weather | prcp_descriptor | sea_state | shape_coulds | dir_coulds | Clearness | cloud_fraction | gusts | Rain | Fog | Snow | Thunder | Hail | Sea_ice | Trivial_correction | Release | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 0 | 0 | 0 | 0 | 2 | 5 | 5 | 0 | 1 | 4 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0 | 5 | 0 | 0 | 0.0 | 0 | 0 | 5 | 2 | 0 | 4 | 0 | 0 | 0 | 0 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 |
| unique | 0 | 0 | 0 | 0 | 1 | 1 | 5 | 0 | 1 | 4 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | 0 | 5 | 0 | 0 | NaN | 0 | 0 | 5 | 2 | 0 | 4 | 0 | 0 | 0 | 0 | 1 | 2 | 1 | 1 | 2 | 1 | 1 | 1 | 2 |
| top | NaN | NaN | NaN | NaN | UNKNOWN | UNKNOWN | -20 | NaN | LEAGUES | MILLAS | 360 DEGREES | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | S | NaN | NaN | NaN | NaN | NaN | EN REFREGONES FUERTES Y DESPUES BONANCIBLE | MUY MALOS CARICES. AGUACEROS, RELAMPAGOS Y TRU... | NaN | GRANDE DEL O Y DEL ENE | NaN | NaN | NaN | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | CLIWOC VERSION 2.0 |
| freq | NaN | NaN | NaN | NaN | 2 | 5 | 1 | NaN | 1 | 1 | 3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1 | NaN | NaN | NaN | NaN | NaN | 1 | 1 | NaN | 1 | NaN | NaN | NaN | NaN | 5 | 4 | 5 | 5 | 4 | 5 | 5 | 5 | 4 |
| mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Wind force scales and languages¶
What about the different scales for the wind force, given different languages?
all_data.data[["c99_data"]].c99_data.wind_force.head()
0 EN REFREGONES FUERTES Y DESPUES BONANCIBLE
1 FOIBLE
2 STIJVE GEREEFDE MARSZEILSKOELTE
3 FRESH GALES AND SQUALLY
4 BONANCIBLE
Name: wind_force, dtype: object