CoNLL-U Format

This module reads and parse data in the standard CONLL-U format as provided in universal dependencies. CONLL-U is a standard format followed to annotate data at sentence level and at word/token level. Annotations in CONLL-U format fulfil the below points:

  1. Word lines contain the annotations of a word/token in 10 fields are separated by single tab characters
  2. Blank lines mark sentence boundaries
  3. Comment lines start with hash (#)

Each word/token has 10 fields defined in the CONLL-U format. Each field represents different attributes of the token whose details are given below:

Fields

1. ID:
ID represents the word/token index in the sentence
2. FORM:
Word/token form or punctuation symbol used in the sentence
3. LEMMA:
Root/stem of the word
4. UPOS:
Universal Part-of-Speech tag
5. XPOS:
Language specific part-of-speed tag. underscore if not available
6. FEATS:
List of morphological features from the universal features inventory or from a defined language specific extension
7. HEAD:
Head of the current word which is wither the value of ID or zero.
8. DEPREL:
Universal dependencies relation to the HEAD (root if HEAD=0) or a defined language specific subtype of one.
9. DEPS:
Enhanced dependency graph in the form of a list of head-deprel pairs
10. MISC:
Any other annotation apart from the above mentioned fields
class urduhack.conll.CoNLL[source]

A Conll class to easily load conll-u formats. This module can also load resources by iterating over string. This module is the main entrance to conll’s functionalities.

static get_fields() → List[str][source]

Get the list of conll fields

Returns:Return list of conll fields
Return type:List[str]
static iter_file(file_name: str) → Iterator[Tuple][source]

Iterate over a CoNLL-U file’s sentences.

Parameters:

file_name (str) – The name of the file whose sentences should be iterated over.

Yields:

Iterator[Tuple] – The sentences that make up the CoNLL-U file.

Raises:
  • IOError – If there is an error opening the file.
  • ParseError – If there is an error parsing the input into a Conll object.
static iter_string(text: str) → Iterator[Tuple][source]

Iterate over a CoNLL-U string’s sentences.

Use this method if you only need to iterate over the CoNLL-U file once and do not need to create or store the Conll object.

Parameters:text (str) – The CoNLL-U string.
Yields:Iterator[Tuple] – The sentences that make up the CoNLL-U file.
Raises:ParseError – If there is an error parsing the input into a Conll object.
static load_file(file_name: str) → List[Tuple][source]

Load a CoNLL-U file given its location.

Parameters:

file_name (str) – The location of the file.

Returns:

A Conll object equivalent to the provided file.

Return type:

List[Tuple]

Raises:
  • IOError – If there is an error opening the given filename.
  • ValueError – If there is an error parsing the input into a Conll object.