CoNLL-U Format¶
This module reads and parse data in the standard CONLL-U format as provided in universal dependencies. CONLL-U is a standard format followed to annotate data at sentence level and at word/token level. Annotations in CONLL-U format fulfil the below points:
- Word lines contain the annotations of a word/token in 10 fields are separated by single tab characters
- Blank lines mark sentence boundaries
- Comment lines start with hash (#)
Each word/token has 10 fields defined in the CONLL-U format. Each field represents different attributes of the token whose details are given below:
Fields¶
1. ID:
- ID represents the word/token index in the sentence
2. FORM:
- Word/token form or punctuation symbol used in the sentence
3. LEMMA:
- Root/stem of the word
4. UPOS:
- Universal Part-of-Speech tag
5. XPOS:
- Language specific part-of-speed tag. underscore if not available
6. FEATS:
- List of morphological features from the universal features inventory or from a defined language specific extension
7. HEAD:
- Head of the current word which is wither the value of ID or zero.
8. DEPREL:
- Universal dependencies relation to the HEAD (root if HEAD=0) or a defined language specific subtype of one.
9. DEPS:
- Enhanced dependency graph in the form of a list of head-deprel pairs
10. MISC:
- Any other annotation apart from the above mentioned fields
-
class
urduhack.conll.
CoNLL
[source]¶ A Conll class to easily load conll-u formats. This module can also load resources by iterating over string. This module is the main entrance to conll’s functionalities.
-
static
get_fields
() → List[str][source]¶ Get the list of conll fields
Returns: Return list of conll fields Return type: List[str]
-
static
iter_file
(file_name: str) → Iterator[Tuple][source]¶ Iterate over a CoNLL-U file’s sentences.
Parameters: file_name (str) – The name of the file whose sentences should be iterated over.
Yields: Iterator[Tuple] – The sentences that make up the CoNLL-U file.
Raises: IOError
– If there is an error opening the file.ParseError
– If there is an error parsing the input into a Conll object.
-
static
iter_string
(text: str) → Iterator[Tuple][source]¶ Iterate over a CoNLL-U string’s sentences.
Use this method if you only need to iterate over the CoNLL-U file once and do not need to create or store the Conll object.
Parameters: text (str) – The CoNLL-U string. Yields: Iterator[Tuple] – The sentences that make up the CoNLL-U file. Raises: ParseError
– If there is an error parsing the input into a Conll object.
-
static
load_file
(file_name: str) → List[Tuple][source]¶ Load a CoNLL-U file given its location.
Parameters: file_name (str) – The location of the file.
Returns: A Conll object equivalent to the provided file.
Return type: List[Tuple]
Raises: IOError
– If there is an error opening the given filename.ValueError
– If there is an error parsing the input into a Conll object.
-
static