The OBO Flat File Format Specification, version 1.0

john.richter@aya.yale.edu (John Day-Richter)
February 3, 2004

OBO format is the text file format used by OBO-Edit, the open source, platform-independent application for viewing and editing ontologies. See also the Java OBO parser guide, which gives details of the OBO parser implemented as part of OBO-Edit, and how to use it.

Please note that OBO 1.2 is the file format used and recommended by the GO Consortium.

OBO 1.0 Format Syntax

Document Structure

The format is basically an extension of the tag-value format of the GO definitions file, with a few modifications. One important difference is that unrecognized tags in any context do not necessarily generate fatal errors (although some parsers may decide to do so; see Parser Requirements below). This allows parsers to read and work with files that contain information not used by a particular tool.

A document in the new format would be structured as follows:

<header>
<stanza>
<stanza>
...

A "stanza" is a labeled section of the document, indicating that an object of a particular type is being described. Stanzas are structured as follows:

[<Object type>]
<tag>: <value>
<tag>: <value>
...

Blank lines are ignored.

Tag-Value Pairs

Each stanza consists of a series of tag-value pairs. Tag-value pairs consist of a tag name, an unescaped colon, the tag value, and then a newline:

<tag name>:<tag value>

All tag-value pairs occur on a single line, unless the lines are broken by \<newline> combinations.

Both the tag name and tag value may contain the following escape characters:

\n
newline
\W
whitespace
\t
tab
\:
colon
\,
comma
\"
double quote
\\
backslash
\(  \)
parentheses
\[  \]
brackets
\{  \}
curly brackets
\<newline>
<no value>

Note that escaped characters are only used when the escaped character has no meaning to the parser. Some tag values that require additional parsing may contain unescaped colons, brackets, quotes, etc., that have meaning in decoding the tag value. Unescaped spaces between the separator colon and the start of the value tag are discarded.

In practice, the OBO parser will assume that any character following a backslash is an escaped character. So \a and \? are legal escape sequences that translated to "a" and "?" respectively.

Some example tag-value pairs:

[Term]
id: GO:0019383
name: (+)-camphor catabolism
def: "The catabolism of (+)-camphor." [GO:ma "Michael \"Mike\" Ashburner was responsible for creating this term"]
comment: This is a gratuitous example\nof an escaped newline

Document Header Tags

The document header consists of a series of tag-value pairs. The format-version tag must be the first tag in the header. The other tags may appear in any order.

Required tags

format-version
gives the version of the encoding of this flat file format. This tag allows parsers to handle a variety of different flat file formats, even if the basic structure of the flat files changes.
typeref
A url pointing to a type description document. A type description is a document in this flat file format containing relationship type definition information. Any document that contains tags requiring non-built-in relationship types (such as the "relationship" tag), must contain a typeref line. Please see "Relationship Types" below for more information.

Optional tags

version
The version of this particular file
date
The current date in dd:mm:yyyy hh:mm format
saved-by
The username of the person to last save this file
auto-generated-by
The program that generated this file
default-namespace
The namespace to which terms will be assigned, unless a different namespace is specified (using the namespace tag in a term stanza).
remark
General comments for this file
subsetdef
A description of a term subset. The value for this tag should contain a subset name, a space, and a quote enclosed subset description.
Example:

subsetdef: goslim_generic "Generic GO Slim"

Some optional tags may not survive round-tripping through a data adapter. See Parser Requirements below.

Stanza Types

Currently, the flat file format supports two stanza types: [Term], and [Typedef]. Unrecognized stanza types must survive round-tripping.

Tags in a [Term] stanza

The term description section is a collection of tag-value pairs. Each term description begins with an id tag. The value of the id tag announces the term to which all the following tags in the term description refer.

A term description does not have to be complete. A term may contain multiple descriptions in a single file (or multiple descriptions in multiple files). Each description may provide additional information about a term. This makes it very simple for parsers to read in specialized or optional information (such as InterPro mappings, for example).

This means that parsers must wait until the end of the parse batch to check whether required information is missing. Multiple descriptions may produce parse errors if:

  • A description contradicts a previous description (ie one term description gives a different term name than another description)
  • A parser has processed all the files in a batch, but a term is still missing some required value (such as a name).

Required tags

id
The unique id of the current term. This can be any string. This tag must always be the first tag in any term description
Example:

id: CAR:0000001

name
The term name. Any term may only have one name defined. If multiple term names are defined, it is a parse error.
Example:

name: Volkswagen Beetle

Optional tags

alt_id
Defines an alternate id for this term. A term may have any number of alternate ids.
Example:

alt_id: CAR:0000666

namespace
The namespace in which the term belongs. If this tag is not present, the term will be assigned to the default-namespace specified in the file header stanza.
Example:

namespace: car_ontology

def
The definition of the current term. There must be zero or one instances of this tag per term description. More than one definition for a term generates a parse error. The value of this tag should be the quote enclosed definition text, followed by a dbxref list containing dbxrefs that describe the origin of this definition (see Dbxref Formatting for information on how dbxref lists are encoded).
Example:

definition: "The Volkswagen Beetle or Bug is a small family car, the best known car of Volkswagen, of Germany, and almost certainly the world. Thanks to its distinctive shape and sound, its reliability, and presumably other factors, it now enjoys a cult status." [http://en.wikipedia.org/ "Wikipedia", VW:0283 ""]

comment
A comment for this term. There must be zero or one instances of this tag per term description. More than one comment for a term generates a parse error.
Example:

comment: Note that this term refers to both the old and new (post-1998) Beetles.

subset
This tag indicates a term subset to which this term belongs. The value of this tag must be a subset name as defined in a subsetdef tag in the file header. If the value of this tag is not mentioned in a subsetdef tag, a parse error will be generated. A term may belong to any number of subsets.
Example:

subset: classic_cars

synonym
This tag gives a synonym for the term; whether the synonym is exact, broad, narrow, or otherwise related to the term is not specified. The value of this tag should be the quote enclosed synonym text, followed by an optional dbxref list containing dbxrefs that describe the origin of this synonym (see Dbxref Formatting for information on how dbxref lists are encoded). A term may have any number of synonyms.
Example:

synonym: "The Bug" [VEH:391840]

related_synonym
exact_synonym
broad_synonym
narrow_synonym
These tags give a synonym for the term of the specified type; see the documentation on synonyms for information on synonym types. The value of the tag should be the quote enclosed synonym text, followed by an optional dbxref list containing dbxrefs that describe the origin of this synonym (see Dbxref Formatting for information on how dbxref lists are encoded). A term may have any number of related synonyms.
Example:

exact_synonym: "VW Bug" [VW:0283, TPT:938VWB]
related_synonym: "Type 1" []

xref_analog
A dbxref that describes an analogous object in another vocabulary (see Dbxref Formatting for information about how the value of this tag must be formatted). A term may have any number of analogous xrefs.
Example:

xref_analog: VW:0283

xref_unknown
A dbxref with an unknown type (see Dbxref Formatting for information about how the value of this tag must be formatted). A term may have any number of unknown typed xrefs. This tag should not be used if possible (see Parser Requirements for information about how parsers may handle this tag).
is_a
This tag describes a subclassing relationship between one term and another. A term may have any number of is a relationships. Terms with no is a relationships are roots. A term with no is a relationships may not specify any relationship tags. To do so is a parse error.
Example:

is_a: CAR:0009478

relationship
This tag describes a typed relationship between this term and another term. The value of this tag should be the relationship type id, and then the id of the target term. The relationship type name must be a relationship type name as defined in a typedef tag stanza. The typedef must either occur in a document in the current parse batch, or in a file imported via a typeref header tag. If the relationship type name is undefined, a parse error will be generated. If the id of the target term cannot be resolved by the end of parsing the current batch of files, this tag describes a "dangling reference". See Parser Requirements for information about how a parser may handle dangling references. If a relationship is specified for a term with an is_obsolete value of true, a parse error will be generated. If a relationship target is a term which is obsolete, a parse error will be generated.
Example:

relationship: part_of CAR:0009478

is_obsolete
This tag indicates whether or not the term is obsolete. Allowable values are "true" and "false" (false is assumed if this tag is not present). Obsolete terms must have no relationship or is_a tags.
Example:

is_obsolete: true

use_term
This tag indicates which term to use instead of an obsolete term. The value of this tag is the id of another term. If the tag value refers to a term that is not specified in the current load batch, it is a "dangling reference" (see Parser Requirements). If this tag is specified and the is_obsolete value for the current term is not true, a parse error will be generated. This tag is not required for terms that specify the is_obsolete tag, but it is recommended (some parsers may choose to issue warnings about obsolete terms that do not specify a replacement term). An obsolete term may have any number of use_term tags.
domain
This tag determines the children that can be assigned to relationships with this type. If the domain is set, term relationships with this type may only have children that are the same as, or subclasses of, the domain term.
Note: This tag is not used in GO at present, although it is available for use in OBO-formatted files for any ontology.
range
This tag specifies the parents that can be assigned to relationships with this type. If the range is set, term relationships with this type may only have parents that are the same as, or subclasses of, the range term.
Note: This tag is not used in GO at present, although it is available for use in OBO-formatted files for any ontology.
is_cyclic
This tag indicates that it is legal to create cycles out of this relationship.
Note: This tag is not used in GO at present, although it is available for use in OBO-formatted files for any ontology.
is_transitive
This tag indicates that the relationship is marked as transitive. This information is very useful to reasoners and other automatic traversals of the graph.
Note: This tag is not used in GO at present, although it is available for use in OBO-formatted files for any ontology.
is_symmetric
This tag indicates that the relationship is marked as symmetric (meaning that if the relationship holds from the child to parent, it also holds from parent to child). This information is very useful to reasoners and other automatic traversals of the graph.
Note: This tag is not used in GO at present, although it is available for use in OBO-formatted files for any ontology.

Dbxref Formatting

Dbxref definitions take the following form:

<dbxref name>

or

<dbxref name> "<dbxref description>"

By convention, the dbxref name is a colon separated key-value pair, but this is not a requirement. If provided, the dbxref description is a string of zero or more characters describing the dbxref. An example of a dbxref would be:

GO:ma "Term written by Michael\, without reference to other sources"

Dbxref lists are used when a tag value must contain several dbxrefs. Dbxref lists take the following form:

[<dbxref definition>, <dbxref definition>, ...]

The brackets may contain zero or more comma separated dbxref definitions. An example of a dbxref list would be:

[GO:ma, GO:mah "Midori created this term"]

Tags in [Typedef] Stanza

[Typedef] stanzas support all the same tags as a [Term] stanza. They just describe different classes of objects.

Back to top

Parser Requirements

This section of the document attempts to establish guidelines for parser behavior. By establishing guidelines, we can ensure some consistency in operation between parsers.

General behavior

All parsers should be capable of failing gracefully and generating errors explaining the failure. Parsers may optionally be capable of generating warnings, if the file being read contains non-fatal errors.

Line parsing errors

All parsers should be able to detect various types of line parsing errors:

  • "Cannot find key-terminating colon": A line does not contain a colon to indicate the end of the tag name.
  • "Tag has no value": There is no value to the right of the key tag colon.
  • "Unrecognized escape character": An unrecognized escape sequence has been used.
  • "Unexpected end of line": More data was expected, but the input line ended.
  • "Expected quoted string": A quoted string was expected, but something else was found.
  • "Unclosed quoted string": A quoted string was found in an appropriate place, but no closing quote was encountered.
  • "Expected dbxref list": A dbxref list was expected, but something else was found.
  • "Malformed dbxref list": A dbxref list was found in an appropriate place, but it was not well formed.
  • "Unclosed dbxref list": A dbxref list was found in an appropriate place, but no closing bracket was found

! Comments

A file may contain any number of lines beginning with !, at any point in the file. These lines may be ignored by a parser, or may be read and stored. Parsers are not required to support the round-tripping of ! comments; they may be entirely discarded.

Unrecognized tags

A parser may do one of three things when an unrecognized tag is found:

  • FAIL: Report a fatal error and terminate parsing.
  • WARN: Report a warning, but continue parsing and ignore the unrecognized tag (recommended).
  • IGNORE: Silently ignore the unrecognized tag.

Optional header tags

Optional header tags need not survive round-tripping. Parsers may choose to preserve the values of these tags, or may choose to ignore these values when reading, but generate their own values when writing a file.

Dbxrefs with unknown types

Unknown dbxrefs can be handled in one of three ways:

  • FAIL: Report a fatal error and terminate parsing.
  • WARN: Report a warning, but continue parsing and read the untyped dbxref (recommended).
  • IGNORE: Silently read the unknown dbxref.

Dangling references

There are several options when a dangling reference is encountered

  • FAIL: Report a fatal error and terminate parsing.
  • WARN_AND_IGNORE: Report a fatal error and ignore the dangling reference.
  • WARN_AND_READ: Report a warning and read in the dangling reference, storing it in a form suitable for round-tripping.
  • READ: Silently read the dangling relationship (recommended).

Graph Structure

Unlike the deprecated GO flat file format, this flat file format does not describe a rooted, directed acyclic graph. This flat file format describes unrooted, possibly cyclic, directed graphs.

Parsers that read this flat file format must be capable of handling the possibility of a cyclic structure and the existence of multiple roots (or no roots at all, in some cyclic graphs).

Some tools require a single root for any graph they manipulate. Such tools may connect multiple roots to a single dummy root node. However, these tools must not output this dummy root when reserializing a graph.

Dbxref descriptions

The same dbxref may occur several times with different descriptions. It is up to the parser to determine the semantics of this situation. The parser may treat these as separate dbxrefs, with different descriptions in different contexts. Alternatively, the parser may treat these as references to a single dbxref. In that case, it is up to the parser to decide how to treat multiple descriptions (it is recommended that the parser issue a warning).

Serializer Conventions

Any parser should be able to read correctly formatted files in any layout. However, it is suggested that serializers obey the following conventions to ensure consistency.

General conventions

  1. Within a single file, all tags relating to a single entity should appear in the same stanza (thereby minimizing the total number of stanzas and keeping all tags regarding a single entity in the same place)
  2. All tags not specified in this document should appear after the the tags that are described above, and should be ordered alphabetically, first on the tag name, then on the tag value. If there are two or more tags with the same name (whether described in this document or not), they should be ordered alphabetically on the tag value.

Stanza conventions

All new stanza declarations should be preceded by a blank line. All stanzas of any type are ordered alphabetically by id (this is to ensure CVS compatibility).

Header tags

Header tags should appear in the following order:

  1. format-version
  2. version
  3. date
  4. saved-by
  5. auto-generated-by
  6. typeref
  7. subsetdef
  8. remark

Ordering Term and Typedef stanzas

Term and Typdef stanzas should be serialized in alphabetical order on the value of their id tag.

Term and Typedef tags

Term and typedef tags should appear in the following order:

  1. id
  2. name
  3. alt_id
  4. namespace
  5. def
  6. comment
  7. subset
  8. synonym
  9. related_synonym
  10. exact_synonym
  11. narrow_synonym
  12. broad_synonym
  13. xref_analog
  14. xref_unknown
  15. is_a
  16. relationship
  17. is_obsolete
  18. use_term
  19. domain
  20. range
  21. is_cyclic
  22. is_transitive
  23. is_symmetric
  24. <any unknown properties>

Relationship tags should be ordered alphabetically on type name, then target id.

Dbxref lists

Values in dbxref lists should be ordered alphabetically on the dbxref name.

Back to top

An example file

The following is a simple example file showing a very small subset of the function ontology.

format-version: GO_1.0
!any comment here
typeref: relationship.types
subsetdef: goslim "Generic GO Slim"
version: $Revision: 1.18 $
date: April 18th, 2003
saved-by: jrichter
remark: Example file

[Term]
id: GO:0003674
name: molecular_function
def: "The action characteristic of a gene product." [GO:curators]
subset: goslim

[Term]
id: GO:0016209
name: antioxidant activity
is_a: GO:0003674
def: "Inhibition of the reactions brought about by dioxygen or peroxides. Usually the antioxidant is effective because it can itself be more easily oxidized than the substance protected. The term is often applied to components that can trap free radicals, thereby breaking the chain reaction that normally leads to extensive biological damage." [ISBN:0198506732]

[Term]
id: GO:0045174
name: glutathione dehydrogenase (ascorbate) activity
xref_analog: EC:1.8.5.1 ""
def: "Catalysis of the reaction: 2 glutathione + dehydroascorbate = glutathione disulfide + ascorbate." [EC:1.8.5.1]
synonym: dehydroascorbate reductase []
is_a: GO:0009055
is_a: GO:0015038
is_a: GO:0016672

Back to top