Tokenization¶
This module is another crucial part of the Urduhack. This module performs tokenization on sentence. It separates different sentence from each other and converts each string into a complete sentence token. Note here you must not confuse yourself with the word token. They are two completely different things.
This library provides state of art word tokenizer for Urdu Language. It takes care of the spaces and where to connect two urdu characters and where not to.
The tokenization of Urdu text is necessary to make it useful for the NLP tasks. This module provides the following functionality:
- Sentence Tokenization
- Word Tokenization
The tokenization of Urdu text is necessary to make it useful for the machine
learning tasks. In the tokenization
module, we solved the problem related to
sentence and word tokenization.
-
urduhack.tokenization.
sentence_tokenizer
(text: str) → List[str][source]¶ Convert
Urdu
text into possible sentences. If successful, this function returns aList
object containing multiple urduString
sentences.Parameters: text (str) – Urdu
textReturns: Returns a list
object containing multiple urdu sentences typestr
.Return type: list Raises: TypeError
– If text is not a str TypeExamples
>>> from urduhack.tokenization import sentence_tokenizer >>> text = "عراق اور شام نے اعلان کیا ہے دونوں ممالک جلد اپنے اپنے سفیروں کو واپس بغداد اور دمشق بھیج دیں گے؟" >>> sentences = sentence_tokenizer(text) >>> sentences ["دونوں ممالک جلد اپنے اپنے سفیروں کو واپس بغداد اور دمشق بھیج دیں گے؟" ,"عراق اور شام نے اعلان کیا ہے۔"]
-
urduhack.tokenization.
word_tokenizer
(sentence: str, max_len: int = 256) → List[str][source]¶ To convert the raw Urdu text into tokens, we need to use
word_tokenizer()
function. Before doing this we need to normalize our sentence as well. For normalizing the urdu sentence useurduhack.normalization.normalize()
function. If the word_tokenizer runs successfully, this function returns aList
object containing urduString
word tokens.Parameters: Returns: Returns a
List[str]
containing urdu tokensReturn type: Examples
>>> sent = 'عراق اور شام نے اعلان کیا ہے دونوں ممالک جلد اپنے اپنے سفیروں کو واپس بغداد اور دمشق بھیج دیں گے؟' >>> from urduhack.tokenization import word_tokenizer >>> word_tokenizer(sent) Tokens: ['عراق', 'اور', 'شام', 'نے', 'اعلان', 'کیا', 'ہے', 'دونوں', 'ممالک' , 'جلد', 'اپنے', 'اپنے', 'سفیروں', 'کو', 'واپس', 'بغداد', 'اور', 'دمشق', 'بھیج', 'دیں', 'گے؟']