{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Generating a data model for CLIWOC\n", "\n", "The purpose of this notebook is to demonstrate the structure of data models used by the `cdm_reader_mapper` toolbox.\n", "\n", "## ICOADS IMMA\n", "\n", "A common format for marine observational records is the ICOADS IMMA format. This is a text format, where each line contains the data (including metadata) for an individual record. The format is _attachment_ based, each record is constructed from a selection of (typically) fixed-width sections (called attachments) containing different subsets of the data or metadata associated with the record. Documentation on the format, and the available attachments can be found at [https://icoads.noaa.gov/e-doc/imma/R3.0-imma1.pdf](https://icoads.noaa.gov/e-doc/imma/R3.0-imma1.pdf).\n", "\n", "Records within the same file can contain different attachments, meaning that the IMMA format is not a fixed-width format, as line lengths will vary between records. Each record, however, must contain a certain subset of the attachments (in this case the `core` (or `c0`), `c1`, and `c98` attachments). \n", "\n", "## Supplementary Data\n", "\n", "Additional data or metadata can be provided in the `c99` attachment. This attachment is not fixed-width as different sources or decks can provide different collections of supplementary data.\n", "\n", "## CLIWOC\n", "\n", "In this example we use a subset of ICOADS release 3.0.0 IMMA formatted data for deck 730, which is data from the Climatological Database for the World's Oceans (CLIWOC). There is a large amount of supplementary data available in the `c99` attachment, which for deck 730 can be split into multiple sections. Here, we will start with the standard schema for the ICOADS IMMA format (included in `cdm_reader_mapper` as the `\"icoads\"` `imodel`), and extend the schema with fields for a subset of the `c99` attachment. We will add fields for the _logbook_ section of the `c99` attachment for this deck.\n", "\n", "An internal schema already exists for this deck (`\"icoads_r300_d730\"`), the purpose of this notebook is to demonstrate how one can extend the `\"icoads\"` data model to parse `c99` data.\n", "\n", "## Overview\n", "\n", "* An initial read of the data subset using the `\"icoads\"` data model which does not parse the `c99` attachment.\n", "* Extension of the `\"icoads\"` schema to add fields for the logbook section of the `c99` attachment for deck 730.\n", "* Construction of a code table for a categorical field in the `c99` attachment.\n", "* Comparison with the internal schema for deck 730." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from __future__ import annotations\n", "import json\n", "import shutil\n", "import warnings\n", "\n", "import pandas as pd\n", "\n", "from cdm_reader_mapper import read_mdf, test_data\n", "from cdm_reader_mapper.mdf_reader.properties import _base as base\n", "\n", "\n", "try:\n", " from importlib.resources import files as get_files\n", "except ImportError:\n", " from importlib_resources import files as get_files\n", "\n", "import pathlib\n", "from collections import OrderedDict\n", "from tempfile import TemporaryDirectory\n", "\n", "\n", "warnings.filterwarnings(\"ignore\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The Data\n", "\n", "For this example we load a subset of ICOADS data for deck 730 from the `cdm_reader_mapper` test data. This is the data that will be used throughout this notebook." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "data_file_path = test_data.test_icoads_r300_d730[\"source\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Initial Read\n", "\n", "First we read the data using the basic `\"icoads\"` data model. This isn't necessary for extending the schema, it is to highlight the raw `c99` data." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING:root:Unknown column_type 'object' for column '('c8', 'PUID')'\n", "WARNING:root:Unknown column_type 'object' for column '('c95', 'ARCR')'\n", "WARNING:root:Unknown column_type 'object' for column '('c96', 'ARCI')'\n", "WARNING:root:Unknown column_type 'object' for column '('c97', 'ARCE')'\n" ] } ], "source": [ "data_bundle = read_mdf(data_file_path, imodel=\"icoads\")\n", "data_raw = data_bundle.data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Supplementary (`c99`) data\n", "\n", "By looking at the `c99` section we can see that the supplementary data has not been parsed." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 99 0 AGI ARCHIVO GENERAL DE INDIAS ...\n", "1 99 0 CARAN CENTRE D'ACCUEIL ET DE RECHERCHE ...\n", "2 99 0 RAZ RIJKSARCHIEF ZEELAND ...\n", "3 99 0 NMM NATIONAL MARITIME MUSEUM ...\n", "4 99 0 AGI ARCHIVO GENERAL DE INDIAS ...\n", "Name: c99, dtype: object" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_raw[\"c99\"].head()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'99 0 NMM NATIONAL MARITIME MUSEUM GREENWICH UNITED KINGDOM NMM ADM/L/R13 ENGLISH 0492500N 405000E 1 1BERMUDA LIZARD N87:17E 230 0 21771100112TUESDAY 12 RAINBOW BRITISH 5TH RATE RN THOMAS COLLINGWOOD CAPTAIN CHARLES WARREN 2ND OFFICER/LIEUTENANT BERMUDA SPITHEAD 0 17710829S25E 39.00 UNKNOWN UNKNOWN -22 LEAGUESNM 180 DEGREES ESE, E FRESH GALES AND SQUALLY 00000000CLIWOC VERSION 2.0'" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_raw[\"c99\"].iloc[3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating a data model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Custom Schema\n", "\n", "To use a custom schema we need to use the `ext_schema_path` argument in `read_mdf`. The structure of the directory is:\n", "\n", "```\n", "name_of_model/\n", " name_of_model.json\n", " code_tables/\n", " ...\n", "```\n", "\n", "The `code_tables` sub-directory contains the code tables that map the key columns in the data to their values.\n", "\n", "In this example we create a temporary directory for the data model, so that it is cleaned up after the notebook is finished; in reality you would want to store the data model in a permanent directory!\n", "\n", "We start from the basic `\"icoads\"` model. The `c99` section will be based on the `\"icoads_r300_d730\"` schema and code tables.\n", "\n", "#### Copy the `\"icoads\"` schema\n", "\n", "First we create a copy of the `\"icoads\"` schema (located at `mdf_reader/schemas/icoads/icoads.json`). NOTE: `cdm_reader_mapper.mdf_reader.properties._base` is used so that we have a relative path to the original schema and code tables." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "tmp_dir = TemporaryDirectory()\n", "my_model_name = \"cliwoc\"\n", "my_model_path = pathlib.Path(tmp_dir.name) / my_model_name\n", "my_model_path.mkdir(exist_ok=True)\n", "\n", "# Get a copy of the \"imma1\" schema\n", "icoads_schema_path = icoads_code_tables_path = get_files(f\"{base}.schemas.icoads\")\n", "icoads_schema_path = pathlib.Path(icoads_schema_path) / \"icoads.json\"\n", "\n", "my_schema_path = my_model_path / (my_model_name + \".json\")\n", "copy = shutil.copyfile(icoads_schema_path, my_schema_path)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Copy the code tables\n", "\n", "We now copy each of the `\"icoads\"` code tables. This includes generic `icoads` code tables (located in `mdf_reader/codes/icoads`)." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# Get code tables and copy to the directory\n", "my_code_tables_path = my_model_path / \"code_tables\"\n", "my_code_tables_path.mkdir(exist_ok=True)\n", "\n", "# Original code table directories (general ICOADS and Deck specific)\n", "icoads_code_tables_path = get_files(f\"{base}.codes.icoads\")\n", "\n", "# Get filenames for each of the code tables\n", "code_table_files = list(icoads_code_tables_path.glob(\"ICOADS.*.json\"))\n", "\n", "# Copy each file\n", "for file in code_table_files:\n", " basename = pathlib.Path(file).name\n", " out_path = my_code_tables_path / basename\n", " shutil.copyfile(file, out_path)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Extending the schema: CLIWOC logbook information\n", "\n", "For this example we'll load the schema into the environment as a dictionary (we use an ordered dictionary to guarantee that the ordering of the fields is maintained!)." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "with pathlib.Path(my_schema_path).open() as io:\n", " schema = json.load(io, object_pairs_hook=OrderedDict)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now add the contents for section `c99`. There are some standard (\"header\"_ fields we need to supply. The `\"sentinal\"` is the prefix for the attachment, this is printed in the raw supplementary data and identifies the start of the attachment.\n", "\n", "We also need to specify the length of the attachment and the layout.\n", "\n", "We then add our data fields to the `elements` field for the `c99` section. We'll add the fields for the logbook component of the supplementary data for CLIWOC data, there are additional components we can resolve but we'll keep it to the logbook for this example." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "schema[\"sections\"][\"c99\"][\"header\"][\"sentinal\"] = \"99 0 \"\n", "schema[\"sections\"][\"c99\"][\"header\"][\"disable_read\"] = False\n", "schema[\"sections\"][\"c99\"][\"header\"][\"field_layout\"] = \"fixed_width\"\n", "schema[\"sections\"][\"c99\"][\"header\"][\"length\"] = 245 + 5 # Sentinal length\n", "schema[\"sections\"][\"c99\"][\"elements\"] = OrderedDict(\n", " {\n", " \"sentinal\": {\n", " \"description\": \"attachment sentinal\",\n", " \"field_length\": 5,\n", " \"column_type\": \"str\",\n", " \"ignore\": True,\n", " },\n", " \"InstAbbr\": {\n", " \"description\": \"Abbreviation of the Institute storing the original data\",\n", " \"field_length\": 8,\n", " \"column_type\": \"str\",\n", " },\n", " \"InstName\": {\n", " \"description\": \"Full name of the Institute storing the original data\",\n", " \"field_length\": 50,\n", " \"column_type\": \"str\",\n", " },\n", " \"InstCity\": {\n", " \"description\": \"City where the Institute storing the data is located\",\n", " \"field_length\": 10,\n", " \"column_type\": \"str\",\n", " },\n", " \"InstCountry\": {\n", " \"description\": \"Country where the Institute storing the data is located\",\n", " \"field_length\": 14,\n", " \"column_type\": \"str\",\n", " },\n", " \"ArchiveID\": {\n", " \"description\": \"Administrative number under which the data is found within the Institute storing the data\",\n", " \"field_length\": 15,\n", " \"column_type\": \"str\",\n", " },\n", " \"ArchiveName\": {\n", " \"description\": \"Administrative name under which the data is found within the Institute storing the data\",\n", " \"field_length\": 17,\n", " \"column_type\": \"str\",\n", " },\n", " \"ArchivePart\": {\n", " \"description\": \"Part of the archive set in which the data is found within the Institute storing the data\",\n", " \"field_length\": 39,\n", " \"column_type\": \"str\",\n", " },\n", " \"ArchivePartSpec\": {\n", " \"description\": \"Specification of the part of the archive set in which the data is found within the Institute storing the data\",\n", " \"field_length\": 31,\n", " \"column_type\": \"str\",\n", " },\n", " \"LogbookID\": {\n", " \"description\": \"Identificaion Number of the logbook containing the data\",\n", " \"field_length\": 30,\n", " \"column_type\": \"str\",\n", " },\n", " \"LogbookLang\": {\n", " \"description\": \"Language of the logbook containing the data\",\n", " \"field_length\": 7,\n", " \"column_type\": \"str\",\n", " },\n", " \"ImageID\": {\n", " \"description\": \"Identificaion Number of the original image of the logbook\",\n", " \"field_length\": 23,\n", " \"column_type\": \"str\",\n", " },\n", " \"IllustrationAvail\": {\n", " \"description\": \"Illustration available on the current page of the logbook\",\n", " \"field_length\": 1,\n", " \"column_type\": \"key\",\n", " \"codetable\": \"CLIWOC_ILLUSTRATION_I\",\n", " },\n", " }\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now write the dictionary to the schema file." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "json_object = json.dumps(schema, indent=2)\n", "\n", "with pathlib.Path(my_schema_path).open(\"w\") as outfile:\n", " outfile.write(json_object)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### `ImageAvail` Code Table\n", "\n", "One of the fields we have added has `\"column_type\"` of `\"key\"`. This is used to indicate categorical data, where the key value maps to a larger descriptive value. We also specified a code table for this field, which should describe that mapping. Let's create that table now. As with the schema it should be json formatted.\n", "\n", "For this field, we have two possible values. We save the dictionary to a json file in the code_tables directory, the name of the file must match the `\"codetable\"` value for the field (plus the `\".json\"` extension)." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "illustration_avail_codes = {\n", " \"0\": \"No illustration on the current logbook page.\",\n", " \"1\": \"Illustration available on the current logbook page.\",\n", "}\n", "illustration_avail_path = my_code_tables_path / \"CLIWOC_ILLUSTRATION_I.json\"\n", "\n", "json_object = json.dumps(illustration_avail_codes, indent=2)\n", "\n", "with pathlib.Path(illustration_avail_path).open(\"w\") as outfile:\n", " outfile.write(json_object)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Reading\n", "\n", "We can now read the data file with the schema we have just created (copied...). We specify the path to the data model (the directory containing the schema json file) and the path to the code tables." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "ERROR:root:imodel is not defined.\n" ] } ], "source": [ "my_bundle = read_mdf(\n", " data_file_path, # Path to the data file\n", " ext_schema_path=my_model_path, # Path to the directory containing the schema json file\n", " ext_table_path=my_code_tables_path, # Path to the directory containing the json code tables\n", ")\n", "my_data = my_bundle.data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Analysing the output\n", "\n", "We can now investigate components of the c99 section." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
c99
InstAbbrInstNameInstCityInstCountryArchiveIDArchiveNameArchivePartArchivePartSpecLogbookIDLogbookLangImageIDIllustrationAvail
0AGIARCHIVO GENERAL DE INDIASSEVILLESPAINNaNNoneNaNNaNCORREOS, 275A R11SPANISHNaN0
1CARANCENTRE D'ACCUEIL ET DE RECHERCHE DES ARCH. NAT...PARISFRANCENaNNoneNaNNaNCOTE - 4/JJ/39FRENCHNaN0
2RAZRIJKSARCHIEF ZEELANDMIDDELBURGNEDERLAND20NoneMCC1391MCC_20_1391DUTCHMCC_20_1391_00320
3NMMNATIONAL MARITIME MUSEUMGREENWICHUNITED KINGDOMNaNNoneNaNNaNNMM ADM/L/R13ENGLISHNaN0
4AGIARCHIVO GENERAL DE INDIASSEVILLESPAINNaNNoneNaNNaNCORREOS, 193B R3SPANISHNaN0
\n", "
" ], "text/plain": [ " c99 \\\n", " InstAbbr InstName InstCity \n", "0 AGI ARCHIVO GENERAL DE INDIAS SEVILLE \n", "1 CARAN CENTRE D'ACCUEIL ET DE RECHERCHE DES ARCH. NAT... PARIS \n", "2 RAZ RIJKSARCHIEF ZEELAND MIDDELBURG \n", "3 NMM NATIONAL MARITIME MUSEUM GREENWICH \n", "4 AGI ARCHIVO GENERAL DE INDIAS SEVILLE \n", "\n", " \\\n", " InstCountry ArchiveID ArchiveName ArchivePart ArchivePartSpec \n", "0 SPAIN NaN None NaN NaN \n", "1 FRANCE NaN None NaN NaN \n", "2 NEDERLAND 20 None MCC 1391 \n", "3 UNITED KINGDOM NaN None NaN NaN \n", "4 SPAIN NaN None NaN NaN \n", "\n", " \n", " LogbookID LogbookLang ImageID IllustrationAvail \n", "0 CORREOS, 275A R11 SPANISH NaN 0 \n", "1 COTE - 4/JJ/39 FRENCH NaN 0 \n", "2 MCC_20_1391 DUTCH MCC_20_1391_0032 0 \n", "3 NMM ADM/L/R13 ENGLISH NaN 0 \n", "4 CORREOS, 193B R3 SPANISH NaN 0 " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_data[[\"c99\"]].head()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
c99
InstAbbrInstNameInstCityInstCountryArchiveIDArchiveNameArchivePartArchivePartSpecLogbookIDLogbookLangImageIDIllustrationAvail
count555510115515
unique444410115411
topAGIARCHIVO GENERAL DE INDIASSEVILLESPAIN20NaNMCC1391CORREOS, 275A R11SPANISHMCC_20_1391_00320
freq22221NaN111215
\n", "
" ], "text/plain": [ " c99 \\\n", " InstAbbr InstName InstCity InstCountry ArchiveID \n", "count 5 5 5 5 1 \n", "unique 4 4 4 4 1 \n", "top AGI ARCHIVO GENERAL DE INDIAS SEVILLE SPAIN 20 \n", "freq 2 2 2 2 1 \n", "\n", " \\\n", " ArchiveName ArchivePart ArchivePartSpec LogbookID LogbookLang \n", "count 0 1 1 5 5 \n", "unique 0 1 1 5 4 \n", "top NaN MCC 1391 CORREOS, 275A R11 SPANISH \n", "freq NaN 1 1 1 2 \n", "\n", " \n", " ImageID IllustrationAvail \n", "count 1 5 \n", "unique 1 1 \n", "top MCC_20_1391_0032 0 \n", "freq 1 5 " ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_data[[\"c99\"]].describe(include=\"all\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Internal Schema\n", "\n", "`cdm_reader_mapper` already includes a data model for the CLIWOC deck. The model parses all sections of supplementary data and provides all required code tables. Let's now read in the data using the `\"icoads_r300_d730\"` model." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING:root:Unknown column_type 'object' for column '('c8', 'PUID')'\n", "WARNING:root:Unknown column_type 'object' for column '('c95', 'ARCR')'\n", "WARNING:root:Unknown column_type 'object' for column '('c96', 'ARCI')'\n", "WARNING:root:Unknown column_type 'object' for column '('c97', 'ARCE')'\n", "WARNING:root:Unknown column_type 'object' for column '('c99_sentinel', 'BLK')'\n" ] } ], "source": [ "all_data = read_mdf(\n", " data_file_path,\n", " imodel=\"icoads_r300_d730\",\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `c99` section has been split into multiple sections. There is no `c99` section in the output, however we now have:\n", "\n", "* `c99_logbook`\n", "* `c99_voyage`\n", "* `c99_data`\n", "\n", "We can compare the `c99_logbook` section to the output of our model. We see that we have extracted the same data, although we chose different column names for the elements." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
c99_logbook
InstAbbrInstNameInstPlaceInstLandNumArchiveSetNameArchiveSetArchivePartSpecificationLogbook_idLogbook_languageImage_NoIllustr
count555510115515
unique444410115411
topAGIARCHIVO GENERAL DE INDIASSEVILLESPAIN20NaNMCC1391CORREOS, 275A R11SPANISHMCC_20_1391_00320
freq22221NaN111215
\n", "
" ], "text/plain": [ " c99_logbook \\\n", " InstAbbr InstName InstPlace InstLand \n", "count 5 5 5 5 \n", "unique 4 4 4 4 \n", "top AGI ARCHIVO GENERAL DE INDIAS SEVILLE SPAIN \n", "freq 2 2 2 2 \n", "\n", " \\\n", " NumArchiveSet NameArchiveSet ArchivePart Specification \n", "count 1 0 1 1 \n", "unique 1 0 1 1 \n", "top 20 NaN MCC 1391 \n", "freq 1 NaN 1 1 \n", "\n", " \n", " Logbook_id Logbook_language Image_No Illustr \n", "count 5 5 1 5 \n", "unique 5 4 1 1 \n", "top CORREOS, 275A R11 SPANISH MCC_20_1391_0032 0 \n", "freq 1 2 1 5 " ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_data.data[[\"c99_logbook\"]].describe(include=\"all\")" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
c99
InstAbbrInstNameInstCityInstCountryArchiveIDArchiveNameArchivePartArchivePartSpecLogbookIDLogbookLangImageIDIllustrationAvail
count555510115515
unique444410115411
topAGIARCHIVO GENERAL DE INDIASSEVILLESPAIN20NaNMCC1391CORREOS, 275A R11SPANISHMCC_20_1391_00320
freq22221NaN111215
\n", "
" ], "text/plain": [ " c99 \\\n", " InstAbbr InstName InstCity InstCountry ArchiveID \n", "count 5 5 5 5 1 \n", "unique 4 4 4 4 1 \n", "top AGI ARCHIVO GENERAL DE INDIAS SEVILLE SPAIN 20 \n", "freq 2 2 2 2 1 \n", "\n", " \\\n", " ArchiveName ArchivePart ArchivePartSpec LogbookID LogbookLang \n", "count 0 1 1 5 5 \n", "unique 0 1 1 5 4 \n", "top NaN MCC 1391 CORREOS, 275A R11 SPANISH \n", "freq NaN 1 1 1 2 \n", "\n", " \n", " ImageID IllustrationAvail \n", "count 1 5 \n", "unique 1 1 \n", "top MCC_20_1391_0032 0 \n", "freq 1 5 " ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_data[[\"c99\"]].describe(include=\"all\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Additional Sections\n", "\n", "We can also look at the additional components we did not parse in our model.\n", "\n", "We can note some remaining issues with the model as we look at the extra data. Most of the challenges relate to language translations." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
c99_voyage
drLatDegdrLatMindrLatSecdrLatHemdrLonDegdrLonMindrLonSecdrLonHemLatDegLatMinLatSecLatHemLonDegLonMinLonSecLonHemLatIndLonIndZeroMeridianLMname1LMdirection1LMdistance1LMname2LMdirection2LMdistance2LMname3LMdirection3LMdistance4PosCoastalCalendar_typelogbook_dateTimeOBDay_of_the_weekPartDayWatchGlassesStart_dayShipNameNationalityShip_typeCompanyName1Rank1Name2Rank2Name3Rank3voyage_fromvoyage_toAnchored_indAnchorPlaceDASnoVoyageIniCourse_shipShip_speedDistanceEncNameEncNat
count4.0000004.0000004.042.0000002.000002.022.0000002.0000002.023.0000003.0000003.03555111.0000.0000.05555.01111.05554244110055500520400
uniqueNaNNaNNaN2NaNNaNNaN1NaNNaNNaN2NaNNaNNaN223411NaN00NaN00NaN111NaN111<NA>2544243110054100520400
topNaNNaNNaNNNaNNaNNaNENaNNaNNaNNNaNNaNNaNE12TENERIFELIZARDN87:17ENaNNaNNaNNaNNaNNaNNaN0217711001NaNTUESDAY3VM<NA>UNKNOWNEL COLONSPANISHPAQUEBOTEMCCTHOMAS D'ORVESCAPITANCHARLES WARREN2ND OFFICER/LIEUTENANTNaNNaNLA HABANALA CORUÑA0NaNNaN17710819WTZNaN175.00NaNNaN
freqNaNNaNNaN3NaNNaNNaN2NaNNaNNaN1NaNNaNNaN232211NaNNaNNaNNaNNaNNaNNaN555NaN111<NA>412111211NaNNaN125NaNNaN11NaN1NaNNaN
mean27.25000024.2500000.0NaN26.50000036.000000.0NaN22.0000009.5000000.0NaN121.66666742.6666670.0NaNNaNNaNNaNNaNNaN230.0NaNNaNNaNNaNNaNNaNNaNNaNNaN12.0NaNNaNNaN8.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
std21.88416516.8794750.0NaN19.09188319.798990.0NaN29.69848513.4350290.0NaN195.20843611.2398100.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.0NaNNaNNaN<NA>NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
min1.0000005.0000000.0NaN13.00000022.000000.0NaN1.0000000.0000000.0NaN4.00000033.0000000.0NaNNaNNaNNaNNaNNaN230.0NaNNaNNaNNaNNaNNaNNaNNaNNaN12.0NaNNaNNaN8.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
25%13.75000017.0000000.0NaN19.75000029.000000.0NaN11.5000004.7500000.0NaN9.00000036.5000000.0NaNNaNNaNNaNNaNNaN230.0NaNNaNNaNNaNNaNNaNNaNNaNNaN12.0NaNNaNNaN8.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
50%29.50000023.0000000.0NaN26.50000036.000000.0NaN22.0000009.5000000.0NaN14.00000040.0000000.0NaNNaNNaNNaNNaNNaN230.0NaNNaNNaNNaNNaNNaNNaNNaNNaN12.0NaNNaNNaN8.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
75%43.00000030.2500000.0NaN33.25000043.000000.0NaN32.50000014.2500000.0NaN180.50000047.5000000.0NaNNaNNaNNaNNaNNaN230.0NaNNaNNaNNaNNaNNaNNaNNaNNaN12.0NaNNaNNaN8.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
max49.00000046.0000000.0NaN40.00000050.000000.0NaN43.00000019.0000000.0NaN347.00000055.0000000.0NaNNaNNaNNaNNaNNaN230.0NaNNaNNaNNaNNaNNaNNaNNaNNaN12.0NaNNaNNaN8.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
\n", "
" ], "text/plain": [ " c99_voyage \\\n", " drLatDeg drLatMin drLatSec drLatHem drLonDeg drLonMin drLonSec \n", "count 4.000000 4.000000 4.0 4 2.000000 2.00000 2.0 \n", "unique NaN NaN NaN 2 NaN NaN NaN \n", "top NaN NaN NaN N NaN NaN NaN \n", "freq NaN NaN NaN 3 NaN NaN NaN \n", "mean 27.250000 24.250000 0.0 NaN 26.500000 36.00000 0.0 \n", "std 21.884165 16.879475 0.0 NaN 19.091883 19.79899 0.0 \n", "min 1.000000 5.000000 0.0 NaN 13.000000 22.00000 0.0 \n", "25% 13.750000 17.000000 0.0 NaN 19.750000 29.00000 0.0 \n", "50% 29.500000 23.000000 0.0 NaN 26.500000 36.00000 0.0 \n", "75% 43.000000 30.250000 0.0 NaN 33.250000 43.00000 0.0 \n", "max 49.000000 46.000000 0.0 NaN 40.000000 50.00000 0.0 \n", "\n", " \\\n", " drLonHem LatDeg LatMin LatSec LatHem LonDeg LonMin \n", "count 2 2.000000 2.000000 2.0 2 3.000000 3.000000 \n", "unique 1 NaN NaN NaN 2 NaN NaN \n", "top E NaN NaN NaN N NaN NaN \n", "freq 2 NaN NaN NaN 1 NaN NaN \n", "mean NaN 22.000000 9.500000 0.0 NaN 121.666667 42.666667 \n", "std NaN 29.698485 13.435029 0.0 NaN 195.208436 11.239810 \n", "min NaN 1.000000 0.000000 0.0 NaN 4.000000 33.000000 \n", "25% NaN 11.500000 4.750000 0.0 NaN 9.000000 36.500000 \n", "50% NaN 22.000000 9.500000 0.0 NaN 14.000000 40.000000 \n", "75% NaN 32.500000 14.250000 0.0 NaN 180.500000 47.500000 \n", "max NaN 43.000000 19.000000 0.0 NaN 347.000000 55.000000 \n", "\n", " \\\n", " LonSec LonHem LatInd LonInd ZeroMeridian LMname1 LMdirection1 \n", "count 3.0 3 5 5 5 1 1 \n", "unique NaN 2 2 3 4 1 1 \n", "top NaN E 1 2 TENERIFE LIZARD N87:17E \n", "freq NaN 2 3 2 2 1 1 \n", "mean 0.0 NaN NaN NaN NaN NaN NaN \n", "std 0.0 NaN NaN NaN NaN NaN NaN \n", "min 0.0 NaN NaN NaN NaN NaN NaN \n", "25% 0.0 NaN NaN NaN NaN NaN NaN \n", "50% 0.0 NaN NaN NaN NaN NaN NaN \n", "75% 0.0 NaN NaN NaN NaN NaN NaN \n", "max 0.0 NaN NaN NaN NaN NaN NaN \n", "\n", " \\\n", " LMdistance1 LMname2 LMdirection2 LMdistance2 LMname3 LMdirection3 \n", "count 1.0 0 0 0.0 0 0 \n", "unique NaN 0 0 NaN 0 0 \n", "top NaN NaN NaN NaN NaN NaN \n", "freq NaN NaN NaN NaN NaN NaN \n", "mean 230.0 NaN NaN NaN NaN NaN \n", "std NaN NaN NaN NaN NaN NaN \n", "min 230.0 NaN NaN NaN NaN NaN \n", "25% 230.0 NaN NaN NaN NaN NaN \n", "50% 230.0 NaN NaN NaN NaN NaN \n", "75% 230.0 NaN NaN NaN NaN NaN \n", "max 230.0 NaN NaN NaN NaN NaN \n", "\n", " \\\n", " LMdistance4 PosCoastal Calendar_type logbook_date TimeOB \n", "count 0.0 5 5 5 5.0 \n", "unique NaN 1 1 1 NaN \n", "top NaN 0 2 17711001 NaN \n", "freq NaN 5 5 5 NaN \n", "mean NaN NaN NaN NaN 12.0 \n", "std NaN NaN NaN NaN 0.0 \n", "min NaN NaN NaN NaN 12.0 \n", "25% NaN NaN NaN NaN 12.0 \n", "50% NaN NaN NaN NaN 12.0 \n", "75% NaN NaN NaN NaN 12.0 \n", "max NaN NaN NaN NaN 12.0 \n", "\n", " \\\n", " Day_of_the_week PartDay Watch Glasses Start_day ShipName Nationality \n", "count 1 1 1 1.0 5 5 5 \n", "unique 1 1 1 2 5 4 \n", "top TUESDAY 3 VM UNKNOWN EL COLON SPANISH \n", "freq 1 1 1 4 1 2 \n", "mean NaN NaN NaN 8.0 NaN NaN NaN \n", "std NaN NaN NaN NaN NaN NaN \n", "min NaN NaN NaN 8.0 NaN NaN NaN \n", "25% NaN NaN NaN 8.0 NaN NaN NaN \n", "50% NaN NaN NaN 8.0 NaN NaN NaN \n", "75% NaN NaN NaN 8.0 NaN NaN NaN \n", "max NaN NaN NaN 8.0 NaN NaN NaN \n", "\n", " \\\n", " Ship_type Company Name1 Rank1 Name2 \n", "count 4 2 4 4 1 \n", "unique 4 2 4 3 1 \n", "top PAQUEBOTE MCC THOMAS D'ORVES CAPITAN CHARLES WARREN \n", "freq 1 1 1 2 1 \n", "mean NaN NaN NaN NaN NaN \n", "std NaN NaN NaN NaN NaN \n", "min NaN NaN NaN NaN NaN \n", "25% NaN NaN NaN NaN NaN \n", "50% NaN NaN NaN NaN NaN \n", "75% NaN NaN NaN NaN NaN \n", "max NaN NaN NaN NaN NaN \n", "\n", " \\\n", " Rank2 Name3 Rank3 voyage_from voyage_to \n", "count 1 0 0 5 5 \n", "unique 1 0 0 5 4 \n", "top 2ND OFFICER/LIEUTENANT NaN NaN LA HABANA LA CORUÑA \n", "freq 1 NaN NaN 1 2 \n", "mean NaN NaN NaN NaN NaN \n", "std NaN NaN NaN NaN NaN \n", "min NaN NaN NaN NaN NaN \n", "25% NaN NaN NaN NaN NaN \n", "50% NaN NaN NaN NaN NaN \n", "75% NaN NaN NaN NaN NaN \n", "max NaN NaN NaN NaN NaN \n", "\n", " \\\n", " Anchored_ind AnchorPlace DASno VoyageIni Course_ship Ship_speed \n", "count 5 0 0 5 2 0 \n", "unique 1 0 0 5 2 0 \n", "top 0 NaN NaN 17710819 WTZ NaN \n", "freq 5 NaN NaN 1 1 NaN \n", "mean NaN NaN NaN NaN NaN NaN \n", "std NaN NaN NaN NaN NaN NaN \n", "min NaN NaN NaN NaN NaN NaN \n", "25% NaN NaN NaN NaN NaN NaN \n", "50% NaN NaN NaN NaN NaN NaN \n", "75% NaN NaN NaN NaN NaN NaN \n", "max NaN NaN NaN NaN NaN NaN \n", "\n", " \n", " Distance EncName EncNat \n", "count 4 0 0 \n", "unique 4 0 0 \n", "top 175.00 NaN NaN \n", "freq 1 NaN NaN \n", "mean NaN NaN NaN \n", "std NaN NaN NaN \n", "min NaN NaN NaN \n", "25% NaN NaN NaN \n", "50% NaN NaN NaN \n", "75% NaN NaN NaN \n", "max NaN NaN NaN " ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.options.display.max_columns = None\n", "all_data.data[[\"c99_voyage\"]].describe(include=\"all\")" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 TENERIFE\n", "1 GREENWICH\n", "2 NL_0_01\n", "3 BERMUDA\n", "4 TENERIFE\n", "Name: ZeroMeridian, dtype: object" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_data.data[[\"c99_voyage\"]].c99_voyage.ZeroMeridian.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Ship types and languages\n", "\n", "For example, the ship types on this deck will be given in many different languages. There is no code table for this variable in the CLIWOC website." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 PAQUEBOTE\n", "2 SNAUW\n", "3 5TH RATE\n", "4 PAQUEBOT\n", "Name: Ship_type, dtype: object" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_data.data[[\"c99_voyage\"]].c99_voyage.Ship_type.dropna().head()" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AT_reading_unitsSST_reading_unitsAP_reading_unitsBART_reading_unitsReferenceCourseReferenceWindDirectionDeclDistance_unitsDistance_units_to_landmarkDistance_units_travelledLongitude_unitsunits_of_measurementhumidity_unitswater_at_pump_unitswind_scaleBARO_typeBARO_brandAPIHumidity_methodcompas_errorcompas_correctionAT_outsideSSTAPwind_dircurrent_dircurrent_speedattached_tempump_waterHumiditywind_forceweatherprcp_descriptorsea_stateshape_couldsdir_couldsClearnesscloud_fractiongustsRainFogSnowThunderHailSea_iceTrivial_correctionRelease
count0000255014500000000000.00.005000.00052040000555555555
unique000011501420000000000NaNNaN0500NaN0052040000121121112
topNaNNaNNaNNaNUNKNOWNUNKNOWN-20NaNLEAGUESMILLAS360 DEGREESNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNSNaNNaNNaNNaNNaNEN REFREGONES FUERTES Y DESPUES BONANCIBLEMUY MALOS CARICES. AGUACEROS, RELAMPAGOS Y TRU...NaNGRANDE DEL O Y DEL ENENaNNaNNaNNaN00000000CLIWOC VERSION 2.0
freqNaNNaNNaNNaN251NaN113NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN1NaNNaNNaNNaNNaN11NaN1NaNNaNNaNNaN545545554
meanNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
stdNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
minNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
25%NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
50%NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
75%NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
maxNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
\n", "
" ], "text/plain": [ " AT_reading_units SST_reading_units AP_reading_units BART_reading_units \\\n", "count 0 0 0 0 \n", "unique 0 0 0 0 \n", "top NaN NaN NaN NaN \n", "freq NaN NaN NaN NaN \n", "mean NaN NaN NaN NaN \n", "std NaN NaN NaN NaN \n", "min NaN NaN NaN NaN \n", "25% NaN NaN NaN NaN \n", "50% NaN NaN NaN NaN \n", "75% NaN NaN NaN NaN \n", "max NaN NaN NaN NaN \n", "\n", " ReferenceCourse ReferenceWindDirection Decl Distance_units \\\n", "count 2 5 5 0 \n", "unique 1 1 5 0 \n", "top UNKNOWN UNKNOWN -20 NaN \n", "freq 2 5 1 NaN \n", "mean NaN NaN NaN NaN \n", "std NaN NaN NaN NaN \n", "min NaN NaN NaN NaN \n", "25% NaN NaN NaN NaN \n", "50% NaN NaN NaN NaN \n", "75% NaN NaN NaN NaN \n", "max NaN NaN NaN NaN \n", "\n", " Distance_units_to_landmark Distance_units_travelled Longitude_units \\\n", "count 1 4 5 \n", "unique 1 4 2 \n", "top LEAGUES MILLAS 360 DEGREES \n", "freq 1 1 3 \n", "mean NaN NaN NaN \n", "std NaN NaN NaN \n", "min NaN NaN NaN \n", "25% NaN NaN NaN \n", "50% NaN NaN NaN \n", "75% NaN NaN NaN \n", "max NaN NaN NaN \n", "\n", " units_of_measurement humidity_units water_at_pump_units wind_scale \\\n", "count 0 0 0 0 \n", "unique 0 0 0 0 \n", "top NaN NaN NaN NaN \n", "freq NaN NaN NaN NaN \n", "mean NaN NaN NaN NaN \n", "std NaN NaN NaN NaN \n", "min NaN NaN NaN NaN \n", "25% NaN NaN NaN NaN \n", "50% NaN NaN NaN NaN \n", "75% NaN NaN NaN NaN \n", "max NaN NaN NaN NaN \n", "\n", " BARO_type BARO_brand API Humidity_method compas_error \\\n", "count 0 0 0 0 0 \n", "unique 0 0 0 0 0 \n", "top NaN NaN NaN NaN NaN \n", "freq NaN NaN NaN NaN NaN \n", "mean NaN NaN NaN NaN NaN \n", "std NaN NaN NaN NaN NaN \n", "min NaN NaN NaN NaN NaN \n", "25% NaN NaN NaN NaN NaN \n", "50% NaN NaN NaN NaN NaN \n", "75% NaN NaN NaN NaN NaN \n", "max NaN NaN NaN NaN NaN \n", "\n", " compas_correction AT_outside SST AP wind_dir current_dir \\\n", "count 0 0.0 0.0 0 5 0 \n", "unique 0 NaN NaN 0 5 0 \n", "top NaN NaN NaN NaN S NaN \n", "freq NaN NaN NaN NaN 1 NaN \n", "mean NaN NaN NaN NaN NaN NaN \n", "std NaN NaN NaN NaN NaN NaN \n", "min NaN NaN NaN NaN NaN NaN \n", "25% NaN NaN NaN NaN NaN NaN \n", "50% NaN NaN NaN NaN NaN NaN \n", "75% NaN NaN NaN NaN NaN NaN \n", "max NaN NaN NaN NaN NaN NaN \n", "\n", " current_speed attached_tem pump_water Humidity \\\n", "count 0 0.0 0 0 \n", "unique 0 NaN 0 0 \n", "top NaN NaN NaN NaN \n", "freq NaN NaN NaN NaN \n", "mean NaN NaN NaN NaN \n", "std NaN NaN NaN NaN \n", "min NaN NaN NaN NaN \n", "25% NaN NaN NaN NaN \n", "50% NaN NaN NaN NaN \n", "75% NaN NaN NaN NaN \n", "max NaN NaN NaN NaN \n", "\n", " wind_force \\\n", "count 5 \n", "unique 5 \n", "top EN REFREGONES FUERTES Y DESPUES BONANCIBLE \n", "freq 1 \n", "mean NaN \n", "std NaN \n", "min NaN \n", "25% NaN \n", "50% NaN \n", "75% NaN \n", "max NaN \n", "\n", " weather prcp_descriptor \\\n", "count 2 0 \n", "unique 2 0 \n", "top MUY MALOS CARICES. AGUACEROS, RELAMPAGOS Y TRU... NaN \n", "freq 1 NaN \n", "mean NaN NaN \n", "std NaN NaN \n", "min NaN NaN \n", "25% NaN NaN \n", "50% NaN NaN \n", "75% NaN NaN \n", "max NaN NaN \n", "\n", " sea_state shape_coulds dir_coulds Clearness \\\n", "count 4 0 0 0 \n", "unique 4 0 0 0 \n", "top GRANDE DEL O Y DEL ENE NaN NaN NaN \n", "freq 1 NaN NaN NaN \n", "mean NaN NaN NaN NaN \n", "std NaN NaN NaN NaN \n", "min NaN NaN NaN NaN \n", "25% NaN NaN NaN NaN \n", "50% NaN NaN NaN NaN \n", "75% NaN NaN NaN NaN \n", "max NaN NaN NaN NaN \n", "\n", " cloud_fraction gusts Rain Fog Snow Thunder Hail Sea_ice \\\n", "count 0 5 5 5 5 5 5 5 \n", "unique 0 1 2 1 1 2 1 1 \n", "top NaN 0 0 0 0 0 0 0 \n", "freq NaN 5 4 5 5 4 5 5 \n", "mean NaN NaN NaN NaN NaN NaN NaN NaN \n", "std NaN NaN NaN NaN NaN NaN NaN NaN \n", "min NaN NaN NaN NaN NaN NaN NaN NaN \n", "25% NaN NaN NaN NaN NaN NaN NaN NaN \n", "50% NaN NaN NaN NaN NaN NaN NaN NaN \n", "75% NaN NaN NaN NaN NaN NaN NaN NaN \n", "max NaN NaN NaN NaN NaN NaN NaN NaN \n", "\n", " Trivial_correction Release \n", "count 5 5 \n", "unique 1 2 \n", "top 0 CLIWOC VERSION 2.0 \n", "freq 5 4 \n", "mean NaN NaN \n", "std NaN NaN \n", "min NaN NaN \n", "25% NaN NaN \n", "50% NaN NaN \n", "75% NaN NaN \n", "max NaN NaN " ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_data.data[[\"c99_data\"]].c99_data.describe(include=\"all\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Wind force scales and languages\n", "\n", "What about the different scales for the wind force, given different languages?" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 EN REFREGONES FUERTES Y DESPUES BONANCIBLE\n", "1 FOIBLE\n", "2 STIJVE GEREEFDE MARSZEILSKOELTE\n", "3 FRESH GALES AND SQUALLY\n", "4 BONANCIBLE\n", "Name: wind_force, dtype: object" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_data.data[[\"c99_data\"]].c99_data.wind_force.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.2" } }, "nbformat": 4, "nbformat_minor": 4 }