
 ISATAB Ephesis export README
====================================

## 1. Introduction
This document describes the ISATAB formatted Ephesis export in its structure and content. As the name suggest, this export format is based on the ISATAB format but also following the transPLANT ISATAB configuration (http://cropnet.pl/phenotypes/?page_id=71).

The ISATAB format is a generic framework format used to store experimentation data in biology. It is composed of three level of hierarchy (investigation, study and assay) where the investigations regroups one or more studies themselves regrouping one or more assays.

This ISATAB file is composed of files for each of these level of hierarchy from which the investigation file is the main one referencing study files and assay files.

This export also adds data files and trait definition files which are external to the ISATAB standard but are referenced in the assay files (see part 4.2).

## 2. Investigation file
The investigation is a group of Ephesis trials which are represented as studies in this format.
The investigation file (named "i_Investigation.txt") is composed of rows and columns. The first column contains sections (with name written in upper case) and fields (with names having first letter of each word in upper case letter). Fields have values that are written in column two and above.

In the following part only the important sections and fields will be presented in order of appearance in the investigation file.

### 2.1. Ontology reference section
The first section ("ONTOLOGY SOURCE REFERENCE") references all ontology used in the investigation, study and assay files. An ontology source reference is described by the following fields:

- "Term Source Name":
    Ontology short name
- "Term Source File":
    File name or a URI of an official resource
- "Term Source Version":
    Version number of the Term Source to support terms tracking
- "Term Source Description":
    Used to disambiguate resources when homologous prefixes have been used

This section can contain multiple entries each written in columns two and above.

### 2.2. Investigation sections
All sections with name starting with "INVESTIGATION" contains basic data for the investigation. As the investigation only relates to a group of study in Ephesis, only the "Investigation Submission Date" field is filled with the export date of the ISATAB file.

### 2.3. Studies sections
Studies are described by a block of multiple sections (with names prefixed with "STUDY"). To describe multiple studies, this block must be repeated. Ephesis trials are listed in this sections as ISATAB studies.

#### 2.3.1. "STUDY" section
The study section contains basic info on the study. A study is described with the following fields:

- "Study Identifier":
    Ephesis trial number
- "Study Title":
    Ephesis trial name
- "Study Description":
    Ephesis trial goal
- "Study File Name":
    Name of the file in the ISATAB folder or archive that defines this study (see part 3)

#### 2.3.2. "STUDY PUBICATIONS" section
The study publication section contains all publications related to the study. A publication is described by the following fields:

- "Study PubMed ID":
    The PubMed IDs of the publication(s) associated with this study (where available)
- "Study Publication DOI":
    A Digital Object Identifier (DOI) for this publication (where available)
- "Study Publication Author List":
    The list of authors associated with this publication (authors separated with semicolons)
- "Study Publication Title"
- "Study Publication Status":
    A term describing the status of this publication (i.e. submitted, in preparation, published)

This section can contain multiple entries each written in columns two and above.

#### 2.3.3. "STUDY FACTORS" section
The study factors section contains all factors used throughout the ISATAB file (in studies and assays). A factor corresponds to an independent variable manipulated by the experimenter with the intention to affect biological systems in a way that can be measured by an assay. In Ephesis exports, the factors are described solely by the field "Study Factor Name" and factor values are defined at the assay level (see assay file part 4.1)

This section can contain multiple entries each written in columns two and above.

#### 2.3.4. "STUDY ASSAYS" section
The study assays section contains assays linked to the study. Most of the fields describing an assay does not relate to Ephesis data, only the "Study Assay File Name" is used to reference the name of the Assay file corresponding the definition of that assay.

In Ephesis export, a study has only one assay file.

#### 2.3.5. "STUDY CONTACTS" section
The study contacts section relates to Ephesis trial contacts and contains the following fields:

- "Study Person Last Name"
- "Study Person First Name"
- "Study Person Mid Initials"
- "Study Person Email"
- "Study Person Phone"
- "Study Person Fax"
- "Study Person Address"
- "Study Person Affiliation":
    Organization affiliation of the person
- "Study Person Roles":
    Term to classify the role(s) performed by this person in the context of the study, which means that the roles reported here need not correspond to roles held withing their affiliated organization. Multiple annotations or values attached to one person may be provided by using a semicolon as a separator (e.g. "submitter;funder;sponsor").

This section can contain multiple entries each written in columns two and above.

## 3. Study file
The study file (named "s_<study name>.txt") links sources (starting biological material) to samples (prepared biological material for assay). The goal of this file is to link biological source material to samples. In the case of Ephesis phenotyping studies, the source material is also the tested sample and so this file primarly list the source plant materials and their characteristics.

The basic layout of columns for this file is shown below.

+-------------+---------------------------+-------------+
| Source Name | Characteristics [<term>]* | Sample Name |
+-------------+---------------------------+-------------+

_* Repeatable column_

### 3.1 Source and characteristics
#### Source
The "Source Name" is a unique identifier referenced only in this file. Since it is not as important as the sample name, it is composed of the sample name and suffixed by "-SRC" (see the "Sample Name" in the following part 3.2).

#### Characteristics
The "Characteristics [<term>]" columns are used to describe the source plant material. The "<term>" in the column header is replaced by the name of the characteristic.
The possible characteristics includes:

- "Organism":
    Plant scientific name
- "Infra-specific name":
    Plant variety (or accession name by default)
- "Organism part":
    Doesn't relate to Ephesis data but is required in the transPLANT configuration and thus is left blank
- "Accession number"
- "Accession name"
- "Lot name"

### 3.2 Sample
The Sample Name column contains a unique identifier also used in assays. It is composed of the source accession name, the source lot name and the source taxonomy scientific name using the following syntax:

    <Taxonomy Scientific Name>-<Accession Name>-<Lot Name>

All spaces in this identifier are replaced by underscores.

Example of sample name:

	Taxonomy Scientific Name = Triticum aestivum aestivum
	Accession Name = A
	Lot Name = a

	Sample Name = Triticum_aestivum_aestivum-A-a

## 4. Assay file
The assay file (named "a_<study name>.txt") lists assays linked to the samples described in the study file. An assay is the combination of a plant material sample and an applied factor (e.g. treatments, drought, ...).

The basic layout of columns in this file is shown below.

+-------------+---------------+--------------------------+------------+---------------------------+-------------------+---------------------------------+
| Sample Name | Material Type | Factor Value [<factor>]* | Assay Name | Characteristics [<term>]* | Derived Data File | Comment [Trait Definition File] |
+-------------+---------------+--------------------------+------------+---------------------------+-------------------+---------------------------------+

_* Repeatable column_

### 4.1. Sample and assays
#### Sample
The Sample Name column contains sample identifier from the study file. The sample material type is required in the Material Type column but doesn't correspond to Ephesis data and is left blank.

#### Factor values
Each sample can have multiple assays factors using "Factor Value [<factor>]" where "<factor>" is set to the name of the factor which should have been declared in investigation file. These column values can be either free text, ontology term or quantitative value. If ontology term is used, the "Term Accession Number" and "Term source REF" column should be added in order to reference the ontology term. If the value is quantitative, the Unit column should be appended with the "Unit Term Accession Number" and "Unit Term Source REF" columns to reference it as an ontology term.

#### Assay
The Assay Name column contains a unique identifier of the assay which is composed of the sample name, the factors of the assay, the campaign (if any) and the observation date (if any) using the following syntax:

    <Sample Name>-(<Factor Name>:<Factor Value>)*-<campaign>-<date>

* Repeatable block

The block in parenthesis is repeated for each factor of the assay (parenthesis and wild-card characters does not appear in the final result but are just used here to represent the repeatable block). All spaces contained in the factor names or values are replaced by underscore.

Example of assay name:

	Sample Name = Triticum_aestivum_aestivum-A-a
	Factor Name 1 = tr (treatment)
	Factor Value 1 = n (nitrate)
	Factor Name 2 = wa (watering)
	Factor Value 2 = d (drought)
	Campaign = 2007
	Date = 2007-02-12

	Assay Name = Triticum_aestivum_aestivum-A-a-tr:n-wa:d-2007-2007-02-12

#### Characteristics
The "Characteristics [<term>]" columns are used to describe the assay. The "<term>" in the column header is replaced by the name of the characteristic.

Ephesis data can have a wide array of assay's characteristics like: plant level characteristics (e.g. pot, block, position, etc.), time scale (e.g. months), etc.

### 4.2 Attached files
Each assay can have a data file and a trait definition file associated.

#### 4.2.1 Data File
The data file obtained in the measurement of traits are associated to assays in the assay file under the "Derived Data File" column. This column contains the name of the data file that has been joined in the ISATAB archive.

The derived data file (named "d_<study name>.txt") is not a part of the ISATAB standard but its format resembles of ISATAB. The derived data file is a tabulation separated value file which uses the column layout shown below.

+------------+------------------------+
| Assay Name | Trait Value [<trait>]* |
+------------+------------------------+

_* Repeatable column_

This file lists assays and for each of them lists their trait value (measurements). The "Assay Name" column uses the same identifier as in the Assay Name column in the assay file. The following columns lists trait value for an assay using the "Trait Value [<trait>]" column where "<trait>" is replaced by the name of the trait. The trait used in this file should be defined in the trait definition file (see next part).

#### 4.2.2 Trait Definition File
For each assay, the traits used in the data files should be defined in a trait definition file. In the assay file, the "Comment [Trait Definition File]" column contains the name of this file located in the ISATAB archive.
As the derived data file, the trait definition file (named "tdf_<study name>.txt") is not a part of the ISATAB standard (yet resembles ISATAB's format).The file uses the column layout shown below.

+------------+-----------+-------------+----------+------+-------+
| Trait Name | Full Name | Description | Protocol | Unit | Scale |
+------------+-----------+-------------+----------+------+-------+

##### Trait identification
The "Trait Name" column lists all traits used in the derived data file and if the trait name is abbreviated, the "Full Name" column contains the non-abbreviated trait name.

"Description" is an optional column with free text values used to described the trait more precisely.

##### Measurement
The "Protocol" column is used to define the protocol used in the measurement of the trait.

The "Unit" column contains the unit of the measurement of the trait.

When the trait corresponds to a qualitative value, the "Scale" column lists the possible values of the trait.
