Tokenization

This module is another crucial part of the Urduhack. This module performs tokenization on sentence. It separates different sentence from each other and converts each string into a complete sentence token. Note here you must not confuse yourself with the word token. They are two completely different things.

This library provides state of art word tokenizer for Urdu Language. It takes care of the spaces and where to connect two urdu characters and where not to.

The tokenization of Urdu text is necessary to make it useful for the NLP tasks. This module provides the following functionality:

  • Sentence Tokenization
  • Word Tokenization

The tokenization of Urdu text is necessary to make it useful for the machine learning tasks. In the tokenization module, we solved the problem related to sentence and word tokenization.

urduhack.tokenization.sentence_tokenizer(text: str) → List[str][source]

Convert Urdu text into possible sentences. If successful, this function returns a List object containing multiple urdu String sentences.

Parameters:text (str) – Urdu text
Returns:Returns a list object containing multiple urdu sentences type str.
Return type:list
Raises:TypeError – If text is not a str Type

Examples

>>> from urduhack.tokenization import sentence_tokenizer
>>> text = "عراق اور شام نے اعلان کیا ہے دونوں ممالک جلد اپنے اپنے سفیروں کو واپس بغداد اور دمشق بھیج دیں گے؟"
>>> sentences = sentence_tokenizer(text)
>>> sentences
["دونوں ممالک جلد اپنے اپنے سفیروں کو واپس بغداد اور دمشق بھیج دیں گے؟" ,"عراق اور شام نے اعلان کیا ہے۔"]
urduhack.tokenization.word_tokenizer(sentence: str, max_len: int = 256) → List[str][source]

To convert the raw Urdu text into tokens, we need to use word_tokenizer() function. Before doing this we need to normalize our sentence as well. For normalizing the urdu sentence use urduhack.normalization.normalize() function. If the word_tokenizer runs successfully, this function returns a List object containing urdu String word tokens.

Parameters:
  • sentence (str) – urdu text or list of text
  • max_len (int) – Maximum text length supported by model
Returns:

Returns a List[str] containing urdu tokens

Return type:

list

Examples

>>> sent = 'عراق اور شام نے اعلان کیا ہے دونوں ممالک جلد اپنے اپنے سفیروں کو واپس بغداد اور دمشق بھیج دیں گے؟'
>>> from urduhack.tokenization import word_tokenizer
>>> word_tokenizer(sent)
Tokens:  ['عراق', 'اور', 'شام', 'نے', 'اعلان', 'کیا', 'ہے', 'دونوں', 'ممالک'
, 'جلد', 'اپنے', 'اپنے', 'سفیروں', 'کو', 'واپس', 'بغداد', 'اور', 'دمشق', 'بھیج', 'دیں', 'گے؟']