Data Layer

To create a new Data Layer you need first to decide on a unique identifier for it -- the key. A common convention is to use something like <source>_<parameter>, i.e. worldpop_popd (a source from WorldPop describing population density).

The source files for Data Layers are created inside the folder src/datalayers/. The filename should be <key>.py. Inside a new class needs to defined, but the class name should be in CamelCase (i.e. WorldpopPopd), following the Python convention for class names. It's derived form the BaseLayer class (or another/custom class that bundles functionality for a specific data source).

Next you need to specify the temporal dimension of your data source, as well as the type of value it contains. This is defined via constants inside the __init__(self) method. When processing and storing data points, these constraints are checked by system.

Set self.time_col to one of the following constants:

Temporal resolution of the data	Constant
Per year (default).	`LayerTimeResolution.YEAR`
Per month	`LayerTimeResolution.MONTH`
Per week (ISO calendar week, starts on Monday).	`LayerTimeResolution.WEEK`
Per day	`LayerTimeResolution.DAY`

Set self.value_type to one of these values:

Value Type	Constant
Floating point number	`LayerValueType.FLOAT`
Whole numbers	`LayerValueType.INTEGER`
Percentage (0 to 1).	`LayerValueType.PERCENTAGE`
Binary value (`True`, `False`).	`LayerValueType.BINARY`
Categorical, without natural order	`LayerValueType.NOMINAL`
Categorical, with natural order (i.e., `low < medium < high`)	`LayerValueType.ORDINAL`
Numerical floating point Value (default).	`LayerValueType.VALUE`

Now you can define the source of the Data Layer based on the template below. After creating the source file it's required to create a corresponding entry via the Django frontend using the same key (this is also where you can set the metadata documention).

TIP

Please also check to the source code and other example data layers for additional function the framework provides for consuming and harmonizing data.

python

from datalayers.datasources.base_layer import BaseLayer
from datalayers.datasources.base_layer import LayerValueType
from datalayers.datasources.base_layer import LayerTimeResolution

class ExampleLayer(BaseLayer):

    def __init__(self):
        super().__init__()
        self.time_col = LayerTimeResolution.YEAR
        self.value_type = LayerValueType.INTEGER

    def download(self):
        # ...

    def process(self, shapes):
        # ...
        
        for shape in shapes:
          # processing routine of your data for the current shape
          # temporal = ...
          # value = ...
          
          self.add_value(shape, temporal, value)

Data download

Inside the download() method you need to define on how to download the original data. In the easiest way this could just be a download of a single file.

python

def download(self):
    # _save_url_to_file() downloads the file and saves it into
    # the Data Layer specific folder in `data/datalayers/<key>/`.
    self._save_url_to_file("https://example.test/data.csv")

For more complex integrations you can use the self.get_data_path() -> Path method to get the actual storage directory for the Data Layer (points to data/datalayers/<key>/).

Data processing

Inside the processing routine you need to define how the downloaded data should be processed in regard to the loaded Shapes.

Suppose the data.csv look like the following table:

shape_name	year	value
Region-A	2022	2
Region-A	2023	3
Region-B	2022	2
Region-B	2023	4
Region-C	2022	1

python

def process(self, shape):
    # open the previously downlaoded CSV file
    df = pd.load_csv(self.get_data_path() / 'data.csv')

    for shape in shapes:
        # clip to rows for current shape
        dfx = df[df['shape_name'] == shape.name]

        # iterate over rows per year
        for i, row in dfx.iterrows()
            self.add_value(shape, row['year'], row['value'])

Documentation

The documentation of a Data Layers follows one guiding principle:

Important

Anything related to data processing needs to be in source code!

Regarding metadata/description this principle weakened, since this information can be edited via the web based backend, and is by doing so not forced using source control.

Int the following tables the distinction between the source based and database based information is shown. As well as the option to document the original data source.

Technical information

Field	Description	Database	Source code
`key`	Unique identifier of the Data Layer inside the Data Hub.	x	x
`name`	Human readable description in one line.	x
`description`	Description of the Data Layer. Describe data, or weaknesses, etc.	x
`category`	Human readable name of a category describing the data. Like "Weather" or "Infrastructure".	x
`tags`		x
`date_included`	First integration of the data layer into the Hub	x
`related_to`	Comma separated list of `key`s of other Data Layers this is related to, i.e., because the both describe the same parameter. The relationship is symmetrical, so if only one direction of a relationship is described, the other side will be visible in the Data Hub as well.
`operation`	Description how the data was processed (sum, mean, count, …).	x	x (actual function)
`unit`	suffix for human representation (°C, mm, entities, …).		x
`value_type`	Float, Int, Percent, Bool.		x
`precision`	Decimal places to show on processed values.		x
`time_col`	Temporal resolution of the integrated data (Annually, Daily, …).		x

After integrating a Data Layer in the system and performing the data download and processing actions, the system will provide an overview that contains:

Spatial coverage: on which shape types and on how many shapes inside each type actual data could be derived.
Temporal coverage: how many data points could be integrated, how complete are the data between the first and last data points (how many values are missing according to the temporal resolution).
Processing history: a log that contains the single download and processing steps with timestamps, as well as warnings/errors that occurred during these steps.

Original data metadata

Original metadata, describing the source data of the Data Layer. This information should provide a quick overview of the nature of the data used in this data layer. It does not follow a "formal" specification since not all data sources provide metadata in such manner.

Field	Description
`format`	MIME-Type of the data type, if available. If not free text, i.e., "API", "PDF".
`format_description`	Further Description/links for consuming/understanding the format.
`format_unit`	Unit used to describe values in the source, like percentage, mm/day, …
`spatial_details`	Type of the geographic data. Like "0.1km raster", "GPS coordinates", "administrative units", …
`spatial_coverage`	What extent does the data cover, like a country or Global.
`temporal_details`	Annually, Daily, Cross-Sectional
`temporal_coverage`	From when to when does the source provide data.
`langauge`	Language the data is provided/documented/published in. ISO 639-2 or -3 code (like, `de`, `en` or `eng`).
`source`	Institution/Organization that published the data
`source_link`	URL to the source.
`license`	License of the data. If available use SPDX license identifier.
`citation`	Preferred citation by the authors of the data. (TODO: Free text? Or specify smth. like bibtex?)
`date_published`	Date the data was first published, year or ISO 8601 format `yyyy-mm-dd`.
`date_last_accessed`	Date the information was last access/checked.

Each Data Layer can have multiple sources attached to it. This might be useful if a Data Layer is collected from multiple source, or the source provides multiple DOIs/citations for its data. Additionally, further items can be connected with the Data Layer, i.e. authors/institutions or publications.

Field	Description
`pid_type`	`DOI`, `ROR` or `ORCID`
`pid`	Persistent identifier of some sorts
`description`	Describe why this source is relevant for the Data Layer and what it contains.

Data Layer ​

Data download ​

Data processing ​

Documentation ​

Technical information ​

Original data metadata ​

Data Layer

Data download

Data processing

Documentation

Technical information

Original data metadata