Data Asset Schema

The standard for describing a Data Asset

Templates for Data Assets in ixo Documents use a structured data format based on open schemas (mainly schema.org) to describe any type of Data Asset existing within the Internet of Impact.

Types of data assets

  • A structured object, such as Verifiable Claim, with a data model that can be processed using a specific tool or algorithm

  • An algorithm for processing or transforming data

  • A table or a CSV file with some data

  • An organised collection of tables

  • A search query

  • A collection of files which are related in a way that provides a meaningful dataset

  • Images capturing data

  • Files relating to machine learning, such as trained parameters or neural network structure definitions

  • Anything else that looks like a data asset!

The standard data model (schema) for data assets

The ixo standard for data assets is compatible with Web 2.0 guidelines for dataset providers used to describe data for search engines such as Google to better understand the content of pages. Data assets are easier to find and understand when they are described with metadata such as name, description, creator, format, etc.

The schema describing Data Assets within ixo Documents implements the schema.org Dataset structure.

Dataset example

For example, if the Data Asset is a Dataset, we would use the schema.org/Dataset definition of Dataset as described in the following table. Included, is information about the publication of the dataset such as the license, when it was published, and identifier(DOI) or sameAs pointing to a canonical version of this Dataset object in a different repository.

Add identifier, license, and sameAs for Datasets that provide provenance and license information.

Required properties

description

Text

A short summary describing a dataset.

Guidelines

  • The summary must be between 50 and 5000 characters long.

  • The summary may include Markdown syntax. Embedded images need to use absolute path URLs (instead of relative paths).

  • When using the JSON-LD format, denote new lines with (two characters: backslash and lower case letter "n").

name

Text

A descriptive name of a dataset. For example, "Snow depth in Northern Hemisphere".

Recommended properties

alternateName

Text

Alternative names that have been used to refer to this dataset, such as aliases or abbreviations. Example (in JSON-LD format):

creator

Person or Organization

The creator or author of this dataset. To uniquely identify individuals, use ORCID ID as the value of the sameAs property of the Person type. To uniquely identify institutions and organizations, use ROR ID. Example (in JSON-LD format):

citation

Text or CreativeWork

Identifies academic articles that are recommended by the data provider be cited in addition to the dataset itself. Provide the citation for the dataset itself with other properties, such as name, identifier,creator, and publisher properties. For example, this property can uniquely identify a related academic publication such as a data descriptor, data paper, or an article for which this dataset is supplementary material for. Examples (in JSON-LD format):

Additional guidelines

  • Don’t use this property to provide citation information for the dataset itself. It is intended to identify related academic articles, not the dataset itself. To provide information necessary to cite the dataset itself use name, identifier, creator, and publisher properties instead.

  • When populating the citation property with a citation snippet, provide the article identifier (such as a DOI) whenever possible.

    Recommended: "Doe J (2014) Influence of X. Biomics 1(1). https://doi.org/10.1111/111"

    Not recommended: "Doe J (2014) Influence of X. Biomics 1(1)."

hasPart or isPartOf

URL or Dataset

If the dataset is a collection of smaller datasets, use the hasPart property to denote such relationship. Conversly, if the dataset is part of a larger dataset, use isPartOf. Both properties can take the form of a URL or a Dataset instance. In case Dataset is used as a value it has to include all of the properties required for a standalone Dataset. Examples:

identifier

URL, Text, or PropertyValue

An identifier, such as a DOI or a Compact Identifier. If the dataset has more than one identifier, repeat the identifier property. If using JSON-LD, this is represented using JSON list syntax.

keywords

Text

Keywords summarizing the dataset.

license

URL, CreativeWork

A license under which the dataset is distributed. For example:

Additional guidelines

  • Provide a URL that unambiguously identifies a specific version of the license used.

    Recommended

    "license" : "https://creativecommons.org/licenses/by/4.0"

    Not recommended

    "license" : "https://creativecommons.org/licenses/by"

sameAs

URL

URL of a reference Web page that unambiguously indicates the dataset's identity, usually in a different repository.

spatialCoverage

Text, Place

You can provide a single point that describes the spatial aspect of the dataset. Only include this property if the dataset has a spatial dimension. For example, a single point where all the measurements were collected, or the coordinates of a bounding box for an area.

Points

Shapes

Use GeoShape to describe areas of different shapes. For example, to specify a bounding box.

Points inside box, circle, line, or polygon properties must be expressed as a space separated pair of two values corresponding to latitude and longitude (in that order).

Named locations

temporalCoverage

Text

The data in the dataset covers a specific time interval. Only include this property if the dataset has a temporal dimension. Schema.org uses the ISO 8601 standard to describe time intervals and time points. You can describe dates differently depending upon the dataset interval. Indicate open-ended intervals with two decimal points (..).

Single date

Time period

Open-ended time period

variableMeasured

Text, PropertyValue

The variable that this dataset measures. For example, temperature or pressure.The variableMeasured property is proposed and pending standardization at schema.org. We encourage publishers to share any feedback on this property with the schema.org community.

version

Text, Number

The version number for the dataset.

url

URL

Location of a page describing the dataset.

DataCatalog

The full definition of DataCatalog is available at schema.org/DataCatalog.

Datasets are often published in repositories that contain many other datasets. The same dataset can be included in more than one such repository. You can refer to a data catalog that this dataset belongs to by referencing it directly.

Recommended properties

includedInDataCatalog

DataCatalog

The catalog to which the dataset belongs.

DataDownload

The full definition of DataDownload is available at schema.org/DataDownload. In addition to Dataset properties, add the following properties for datasets that provide download options.

The distribution property describes how to get the dataset itself because the URL often points to the landing page describing the dataset. The distribution property describes where to get the data and in what format. This property can have several values: for instance, a CSV version has one URL and an Excel version is available at another.

Required properties

distribution.contentUrl

URL

The link for the download.

Recommended properties

distribution

DataDownload

The description of the location for download of the dataset and the file format for download.

distribution.encodingFormat

Text, URL

The file format of the distribution.

Tabular datasets

A tabular dataset is one organised primarily in terms of a grid of rows and columns. For pages that embed tabular datasets, you can also create more explicit markup, building on the basic approach described above.

Attribution and further resources

The structured data model for ixo data assets builds on schema.org and Google Developer guidelines. To build and test Data Asset templates, a great resource is Google's Structured Data Markup Helper.

Last updated