Urduhack¶
Urduhack is a NLP library for urdu language. It comes with a lot of battery included features to help you process Urdu data in the easiest way possible.
Urduhack has different modules all of which serve a specific purpose. You can load any of them and check out their results by giving in your inputs. urduhack has got some magic functions that can make your life easier. You just need to access a particular module and get amazing results by giving in your data. Normalization, Tokenization and Preprocess are the main modules of Urduhack.
Our Goal¶
- Academic users Easier experimentation to prove their hypothesis without coding from scratch.
- NLP beginners Learn how to build an NLP project with production level code quality.
- NLP developers Build a production level application within minutes.
Urduhack is maintained by Ikram Ali and Contributors.
Installation¶
Notes¶
Note
Urduhack is supported on the following Python versions
Python | 3.8 | 3.7 | 3.6 | 3.4 | 2.7 |
Urduhack | Yes | Yes | Yes |
Basic Installation¶
Note
Urduhack developed using Tensorflow. Its need Tensorflow cpu for prediction and for development and training the models its uses Tensorflow-gpu. following instructions will install Tensorflow
The easiest way to install urduhack is by pip install.
Installing with Tensorflow cpu version.:
$ pip install Urduhack[tf]
Installing with Tensorflow gpu version.:
$ pip install Urduhack[tf-gpu]
Package Dependencies¶
Having so many functionality, urduhack depends on a number of other packages. Try to avoid any kind of conflict. It is preferred that you create a virtual environment and install urduhack in that environment.
- Tensorflow > 2.0.0 Use for training, evaluating and testing deep neural network model.
- transformers Use for bert implementation for training and evaluation.
- tensorflow-datasets Use for download and prepare the dataset,read it into a model using the tf.data.Dataset API.
- Click With help of this library Urduhack commandline application developed.
Quickstart¶
Every python package needs an import statement so let’s do that first.:
>>>import urduhack
Overview¶
Normalization¶
The normalization of Urdu text is necessary to make it useful for the machine
learning tasks. In the normalization
module, the very basic
problems faced when working with Urdu data are handled with ease and
efficiency. All the problems and how normalization
module handles
them are listed below.
This modules fixes the problem of correct encodings for the Urdu characters as well as replace Arabic characters with correct Urdu characters. This module brings all the characters in the specified unicode range (0600-06FF) for Urdu language.
It also fixes the problem of joining of different Urdu words. By joining we mean that when space between two Urdu words is removed, they must not make a new word. Their rendering must not change and even after the removal of space they should look the same.
Tokenization¶
This module is another crucial part of the Urduhack. This module performs tokenization on sentence. It separates different sentence from each other and converts each string into a complete sentence token. Note here you must not confuse yourself with the word token. They are two completely different things.
This library provides state of art word tokenizer for Urdu Language. It takes care of the spaces and where to connect two urdu characters and where not to.
The tokenization of Urdu text is necessary to make it useful for the NLP tasks. This module provides the following functionality:
- Sentence Tokenization
- Word Tokenization
The tokenization of Urdu text is necessary to make it useful for the machine
learning tasks. In the tokenization
module, we solved the problem related to
sentence and word tokenization.
Tutorial¶
CoNLL-U Format¶
We aspire to maintain data for all the tasks in CoNNL-U format. CoNLL-U format holds sentence and token level data along with their
attributes. Below we will show how to use urduhack’s CoNLL
module.
>>> from urduhack import CoNLL
To iterate over sentences in CoNLL-U format we will use iter_string()
function.
>>> from urduhack.conll.tests.test_parser import CONLL_SENTENCE
It will yield a sentence in proper CoNLL-U format from which we can extract sentence level and token level attributes.
>>> for sentence in CoNLL.iter_string(CONLL_SENTENCE):
sent_meta, tokens = sentence
print(f"Sentence ID: {sent_meta['sent_id']}")
print(f"Sentence Text: {sent_meta['text']}")
for token in tokens:
print(token)
Sentence ID: test-s13
Sentence Text: والدین معمولی زخمی ہوئے ہےں۔
{'id': '1', 'text': 'والدین', 'lemma': 'والدین', 'upos': 'NOUN', 'xpos': 'NN', 'feats': 'Case=Acc|Gender=Masc|Number=Sing|Person=3', 'head': '4', 'deprel': 'nsubj', 'deps': '_', 'misc': 'Vib=0|Tam=0|ChunkId=NP|ChunkType=head'}
{'id': '2', 'text': 'معمولی', 'lemma': 'معمولی', 'upos': 'ADJ', 'xpos': 'JJ', 'feats': 'Case=Nom', 'head': '3', 'deprel': 'advmod', 'deps': '_', 'misc': 'ChunkId=JJP|ChunkType=head'}
{'id': '3', 'text': 'زخمی', 'lemma': 'زخمی', 'upos': 'ADJ', 'xpos': 'JJ', 'feats': 'Case=Nom|Gender=Masc|Number=Sing|Person=3', 'head': '4', 'deprel': 'compound', 'deps': '_', 'misc': 'Vib=0|Tam=0|ChunkId=JJP2|ChunkType=head'}
{'id': '4', 'text': 'ہوئے', 'lemma': 'ہو', 'upos': 'VERB', 'xpos': 'VM', 'feats': 'Aspect=Perf|Number=Plur|Person=2|Polite=Form|VerbForm=Part|Voice=Act', 'head': '0', 'deprel': 'root', 'deps': '_', 'misc': 'Vib=یا|Tam=yA|ChunkId=VGF|ChunkType=head|Stype=declarative'}
{'id': '5', 'text': 'ہےں', 'lemma': 'ہے', 'upos': 'AUX', 'xpos': 'VAUX', 'feats': 'Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin', 'head': '4', 'deprel': 'aux', 'deps': '_', 'misc': 'SpaceAfter=No|Vib=ہے|Tam=hE|ChunkId=VGF|ChunkType=child'}
{'id': '6', 'text': '۔', 'lemma': '۔', 'upos': 'PUNCT', 'xpos': 'SYM', 'feats': '_', 'head': '4', 'deprel': 'punct', 'deps': '_', 'misc': 'ChunkId=VGF|ChunkType=child'}
To load a file in ConLL-U format, we will use urduhack.CoNLL.load_file()
function.
>>> sentences = ConLL.load_file("urdu_text.conll")
>>> for sentence in sentences:
sent_meta, tokens = sentence
print(f"Sentence ID: {sent_meta['sent_id']}")
print(f"Sentence Text: {sent_meta['text']}")
for token in tokens:
print(token)
Sentence ID: test-s13
Sentence Text: والدین معمولی زخمی ہوئے ہےں۔
{'id': '1', 'text': 'والدین', 'lemma': 'والدین', 'upos': 'NOUN', 'xpos': 'NN', 'feats': 'Case=Acc|Gender=Masc|Number=Sing|Person=3', 'head': '4', 'deprel': 'nsubj', 'deps': '_', 'misc': 'Vib=0|Tam=0|ChunkId=NP|ChunkType=head'}
{'id': '2', 'text': 'معمولی', 'lemma': 'معمولی', 'upos': 'ADJ', 'xpos': 'JJ', 'feats': 'Case=Nom', 'head': '3', 'deprel': 'advmod', 'deps': '_', 'misc': 'ChunkId=JJP|ChunkType=head'}
{'id': '3', 'text': 'زخمی', 'lemma': 'زخمی', 'upos': 'ADJ', 'xpos': 'JJ', 'feats': 'Case=Nom|Gender=Masc|Number=Sing|Person=3', 'head': '4', 'deprel': 'compound', 'deps': '_', 'misc': 'Vib=0|Tam=0|ChunkId=JJP2|ChunkType=head'}
{'id': '4', 'text': 'ہوئے', 'lemma': 'ہو', 'upos': 'VERB', 'xpos': 'VM', 'feats': 'Aspect=Perf|Number=Plur|Person=2|Polite=Form|VerbForm=Part|Voice=Act', 'head': '0', 'deprel': 'root', 'deps': '_', 'misc': 'Vib=یا|Tam=yA|ChunkId=VGF|ChunkType=head|Stype=declarative'}
{'id': '5', 'text': 'ہےں', 'lemma': 'ہے', 'upos': 'AUX', 'xpos': 'VAUX', 'feats': 'Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin', 'head': '4', 'deprel': 'aux', 'deps': '_', 'misc': 'SpaceAfter=No|Vib=ہے|Tam=hE|ChunkId=VGF|ChunkType=child'}
{'id': '6', 'text': '۔', 'lemma': '۔', 'upos': 'PUNCT', 'xpos': 'SYM', 'feats': '_', 'head': '4', 'deprel': 'punct', 'deps': '_', 'misc': 'ChunkId=VGF|ChunkType=child'}
Pipeline Module¶
Pipeline is a special module in urduhack. It’s importance can be realized by the fact that it performs operation at Document, Sentence and Token level. We can convert a document to sentence and a sentence into tokens in one go using the pipeline module. After that we can run models or any other operation at the document, sentence and token levels. Now we will go into these steps one by on.
Document¶
We can get the document using pipeline module.
>>> from urduhack import Pipeline
>>> nlp = Pipeline()
>>> text = """
گزشتہ ایک روز کے دوران کورونا کے سبب 118 اموات ہوئیں جس کے بعد اموات کا مجموعہ 3 ہزار 93 ہوگیا ہے۔
سب سے زیادہ اموات بھی پنجاب میں ہوئی ہیں جہاں ایک ہزار 202 افراد جان کی بازی ہار چکے ہیں۔
سندھ میں 916، خیبر پختونخوا میں 755، اسلام آباد میں 94، گلگت بلتستان میں 18، بلوچستان میں 93 اور ا?زاد کشمیر میں 15 افراد کورونا وائرس سے جاں بحق ہو چکے ہیں۔
"""
>>> doc = nlp(text)
>>> print(doc.text)
گزشتہ ایک روز کے دوران کورونا کے سبب 118 اموات ہوئیں جس کے بعد اموات کا مجموعہ 3 ہزار 93 ہوگیا ہے۔
سب سے زیادہ اموات بھی پنجاب میں ہوئی ہیں جہاں ایک ہزار 202 افراد جان کی بازی ہار چکے ہیں۔
سندھ میں 916، خیبر پختونخوا میں 755، اسلام آباد میں 94، گلگت بلتستان میں 18، بلوچستان میں 93 اور ا?زاد کشمیر میں 15 افراد کورونا وائرس سے جاں بحق ہو چکے ہیں۔
Sentence¶
Now to get the sentences from the Document.
>>> for sentence in doc.sentences:
print(sentence.text)
گزشتہ ایک روز کے دوران کورونا کے سبب 118 اموات ہوئیں جس کے بعد اموات کا مجموعہ 3 ہزار 93 ہوگیا ہے
سب سے زیادہ اموات بھی پنجاب میں ہوئی ہیں
جہاں ایک ہزار 202 افراد جان کی بازی ہار چکے ہیں۔
سندھ میں 916، خیبر پختونخوا میں 755، اسلام آباد میں 94، گلگت بلتستان میں 18، بلوچستان میں 93 اور ا?زاد کشمیر میں 15 افراد کورونا وائرس سے جاں بحق ہو چکے ہی
گزشتہ ایک روز کے دوران کورونا کے سبب 118 اموات ہوئیں جس کے بعد اموات کا مجموعہ 3 ہزار 93 ہوگیا ہے
Word¶
To get words from sentence.
>>> for word in sentence.words:
print(word.text)
گزشتہ
ایک
روز
کے
دوران
کورونا
کے
سبب
118
اموات
ہوئیںجس
کے
بعد
اموات
کا
مجموعہ
3
ہزار
93
ہو
گیا
ہے۔
Reference¶
CoNLL-U Format¶
This module reads and parse data in the standard CONLL-U format as provided in universal dependencies. CONLL-U is a standard format followed to annotate data at sentence level and at word/token level. Annotations in CONLL-U format fulfil the below points:
- Word lines contain the annotations of a word/token in 10 fields are separated by single tab characters
- Blank lines mark sentence boundaries
- Comment lines start with hash (#)
Each word/token has 10 fields defined in the CONLL-U format. Each field represents different attributes of the token whose details are given below:
Fields¶
1. ID:
- ID represents the word/token index in the sentence
2. FORM:
- Word/token form or punctuation symbol used in the sentence
3. LEMMA:
- Root/stem of the word
4. UPOS:
- Universal Part-of-Speech tag
5. XPOS:
- Language specific part-of-speed tag. underscore if not available
6. FEATS:
- List of morphological features from the universal features inventory or from a defined language specific extension
7. HEAD:
- Head of the current word which is wither the value of ID or zero.
8. DEPREL:
- Universal dependencies relation to the HEAD (root if HEAD=0) or a defined language specific subtype of one.
9. DEPS:
- Enhanced dependency graph in the form of a list of head-deprel pairs
10. MISC:
- Any other annotation apart from the above mentioned fields
-
class
urduhack.conll.
CoNLL
[source]¶ A Conll class to easily load conll-u formats. This module can also load resources by iterating over string. This module is the main entrance to conll’s functionalities.
-
static
get_fields
() → List[str][source]¶ Get the list of conll fields
Returns: Return list of conll fields Return type: List[str]
-
static
iter_file
(file_name: str) → Iterator[Tuple][source]¶ Iterate over a CoNLL-U file’s sentences.
Parameters: file_name (str) – The name of the file whose sentences should be iterated over.
Yields: Iterator[Tuple] – The sentences that make up the CoNLL-U file.
Raises: IOError
– If there is an error opening the file.ParseError
– If there is an error parsing the input into a Conll object.
-
static
iter_string
(text: str) → Iterator[Tuple][source]¶ Iterate over a CoNLL-U string’s sentences.
Use this method if you only need to iterate over the CoNLL-U file once and do not need to create or store the Conll object.
Parameters: text (str) – The CoNLL-U string. Yields: Iterator[Tuple] – The sentences that make up the CoNLL-U file. Raises: ParseError
– If there is an error parsing the input into a Conll object.
-
static
load_file
(file_name: str) → List[Tuple][source]¶ Load a CoNLL-U file given its location.
Parameters: file_name (str) – The location of the file.
Returns: A Conll object equivalent to the provided file.
Return type: List[Tuple]
Raises: IOError
– If there is an error opening the given filename.ValueError
– If there is an error parsing the input into a Conll object.
-
static
Normalization¶
The normalization of Urdu text is necessary to make it useful for the machine
learning tasks. In the normalize
module, the very basic
problems faced when working with Urdu data are handled with ease and
efficiency. All the problems and how normalize
module handles
them are listed below.
This modules fixes the problem of correct encodings for the Urdu characters as well as replace Arabic characters with correct Urdu characters. This module brings all the characters in the specified unicode range (0600-06FF) for Urdu language.
It also fixes the problem of joining of different Urdu words. By joining we mean that when space between two Urdu words is removed, they must not make a new word. Their rendering must not change and even after the removal of space they should look the same.
You can use the library to normalize the Urdu text for correct unicode characters. By normalization we mean to end the confusion between Urdu and Arabic characters, to replace two words with one word keeping in mind the context they are used in. Like the character ‘ﺁ’ and ‘ﺂ’ are to be replaced by ‘آ’. All this is done using regular expressions.
The normalization of Urdu text is necessary to make it useful for the machine learning tasks. This module provides the following functionality:
- Normalizing Single Characters
- Normalizing Combine Characters
- Put Spaces Before & After Digits
- Put Spaces After Urdu Punctuations
- Put Spaces Before & After English Words
- Removal of Diacritics from Urdu Text
-
urduhack.normalization.
normalize_characters
(text: str) → str[source]¶ The most important module in the UrduHack is the
character
module, defined in the module with the same name. You can use this module separately to normalize a piece of text to a proper specified Urdu range (0600-06FF). To get an understanding of how this module works, one needs to understand unicode. Every character has a unicode. You can search for any character unicode from any language you will find it. No two characters can have the same unicode. This module works with reference to the unicodes. Now as urdu language has its roots in Arabic, Parsian and Turkish. So we have to deal with all those characters and convert them to a normal urdu character. To get a bit more of what the above explanation means is.:>>> all_fes = ['ﻑ', 'ﻒ', 'ﻓ', 'ﻔ', ] >>> urdu_fe = 'ف'
All the characters in all_fes are same but they come from different languages and they all have different unicodes. Now as computers deal with numbers, same character appearing in more than one place in a different language will have a different unicode and that will create confusion which will create problems in understanding the context of the data.
character
module will eliminate this problem by replacing all the characters in all_fes by urdu_fe.This provides the functionality to replace wrong arabic characters with correct urdu characters and fixed the combine|join characters issue.
Replace
urdu
text characters with correctunicode
characters.Parameters: text (str) – Urdu
textReturns: Returns a str
object containing normalized text.Return type: str Examples
>>> from urduhack.normalization import normalize_characters >>> # Text containing characters from Arabic Unicode block >>> text = "مجھ کو جو توڑا ﮔیا تھا" >>> normalized_text = normalize_characters(text) >>> # Normalized text - Arabic characters are now replaced with Urdu characters >>> normalized_text مجھ کو جو توڑا گیا تھا
-
urduhack.normalization.
normalize_combine_characters
(text: str) → str[source]¶ To normalize combine characters with single character unicode text, use the
normalize_combine_characters()
function in thecharacter
module.Replace combine|join
urdu
characters with single unicode characterParameters: text (str) – Urdu
textReturns: Returns a str
object containing normalized text.Return type: str Examples
>>> from urduhack.normalization import normalize_combine_characters >>> # In the following string, Alif ('ا') and Hamza ('ٔ ') are separate characters >>> text = "جرأت" >>> normalized_text = normalize_combine_characters(text) >>> # Now Alif and Hamza are replaced by a Single Urdu Unicode Character! >>> normalized_text جرأت
-
urduhack.normalization.
english_characters_space
(text: str) → str[source]¶ Functionality to add spaces before and after English words in the given Urdu text. It is an important step in normalization of the Urdu data.
this function returns a
String
object which contains the original text with spaces before & after English words.Parameters: text (str) – Urdu
textReturns: Returns a str
object containing normalized text.Return type: str Examples
>>> from urduhack.normalization import english_characters_space >>> text = "خاتون Aliyaنے بچوںUzma and Aliyaکے قتل کا اعترافConfession کیا ہے۔" >>> normalized_text = english_characters_space(text) >>> normalized_text خاتون Aliya نے بچوں Uzma and Aliya کے قتل کا اعتراف Confession کیا ہے۔
-
urduhack.normalization.
punctuations_space
(text: str) → str[source]¶ Add spaces after punctuations used in
urdu
writingParameters: text (str) – Urdu
textReturns: Returns a str
object containing normalized text.Return type: str Examples
>>> from urduhack.normalization import punctuations_space >>> text = "ہوتا ہے ۔ ٹائپ" >>> normalized_text = punctuations_space(text) >>> normalized_text ہوتا ہے۔ ٹائپ
-
urduhack.normalization.
digits_space
(text: str) → str[source]¶ Add spaces before|after numeric and urdu digits
Parameters: text (str) – Urdu
textReturns: Returns a str
object containing normalized text.Return type: str Examples
>>> from urduhack.normalization import digits_space >>> text = "20فیصد" >>> normalized_text = digits_space(text) >>> normalized_text 20 فیصد
-
urduhack.normalization.
remove_diacritics
(text: str) → str[source]¶ Remove
urdu
diacritics from text. It is an important step in pre-processing of the Urdu data. This function returns a String object which contains the original text minus Urdu diacritics.Parameters: text (str) – Urdu
textReturns: Returns a str
object containing normalized text.Return type: str Examples
>>> from urduhack.normalization import remove_diacritics >>> text = "شیرِ پنجاب" >>> normalized_text = remove_diacritics(text) >>> normalized_text شیر پنجاب
-
urduhack.normalization.
normalize
(text: str) → str[source]¶ To normalize some text, all you need to do pass
unicode
text. It will return astr
with normalized characters both single and combined, proper spaces after digits and punctuations and diacritics removed.Parameters: text (str) – Urdu
textReturns: Normalized urdu text Return type: str Raises: TypeError
– If text param is not not str Type.Examples
>>> from urduhack import normalize >>> text = "اَباُوگل پاکستان ﻤﯿﮟ20سال ﺳﮯ ، وسائل کی کوئی کمی نہیں ﮨﮯ۔" >>> normalized_text = normalize(text) >>> # The text now contains proper spaces after digits and punctuations, >>> # normalized characters and no diacritics! >>> normalized_text اباوگل پاکستان ﻤﯿﮟ 20 سال ﺳﮯ ، وسائل کی کوئی کمی نہیں ﮨﮯ ۔
Tokenization¶
This module is another crucial part of the Urduhack. This module performs tokenization on sentence. It separates different sentence from each other and converts each string into a complete sentence token. Note here you must not confuse yourself with the word token. They are two completely different things.
This library provides state of art word tokenizer for Urdu Language. It takes care of the spaces and where to connect two urdu characters and where not to.
The tokenization of Urdu text is necessary to make it useful for the NLP tasks. This module provides the following functionality:
- Sentence Tokenization
- Word Tokenization
The tokenization of Urdu text is necessary to make it useful for the machine
learning tasks. In the tokenization
module, we solved the problem related to
sentence and word tokenization.
-
urduhack.tokenization.
sentence_tokenizer
(text: str) → List[str][source]¶ Convert
Urdu
text into possible sentences. If successful, this function returns aList
object containing multiple urduString
sentences.Parameters: text (str) – Urdu
textReturns: Returns a list
object containing multiple urdu sentences typestr
.Return type: list Raises: TypeError
– If text is not a str TypeExamples
>>> from urduhack.tokenization import sentence_tokenizer >>> text = "عراق اور شام نے اعلان کیا ہے دونوں ممالک جلد اپنے اپنے سفیروں کو واپس بغداد اور دمشق بھیج دیں گے؟" >>> sentences = sentence_tokenizer(text) >>> sentences ["دونوں ممالک جلد اپنے اپنے سفیروں کو واپس بغداد اور دمشق بھیج دیں گے؟" ,"عراق اور شام نے اعلان کیا ہے۔"]
-
urduhack.tokenization.
word_tokenizer
(sentence: str, max_len: int = 256) → List[str][source]¶ To convert the raw Urdu text into tokens, we need to use
word_tokenizer()
function. Before doing this we need to normalize our sentence as well. For normalizing the urdu sentence useurduhack.normalization.normalize()
function. If the word_tokenizer runs successfully, this function returns aList
object containing urduString
word tokens.Parameters: Returns: Returns a
List[str]
containing urdu tokensReturn type: Examples
>>> sent = 'عراق اور شام نے اعلان کیا ہے دونوں ممالک جلد اپنے اپنے سفیروں کو واپس بغداد اور دمشق بھیج دیں گے؟' >>> from urduhack.tokenization import word_tokenizer >>> word_tokenizer(sent) Tokens: ['عراق', 'اور', 'شام', 'نے', 'اعلان', 'کیا', 'ہے', 'دونوں', 'ممالک' , 'جلد', 'اپنے', 'اپنے', 'سفیروں', 'کو', 'واپس', 'بغداد', 'اور', 'دمشق', 'بھیج', 'دیں', 'گے؟']
Text PreProcessing¶
The pre-processing of Urdu text is necessary to make it useful for the machine learning tasks. This module provides the following functionality:
- Normalize whitespace
- Replace urls
- Replace emails
- Replace number
- Replace phone_number
- Replace currency_symbols
You can look for all the different functions that come with pre-process
module in the reference here preprocess
.
-
urduhack.preprocessing.
normalize_whitespace
(text: str)[source]¶ Given
text
str, replace one or more spacings with a single space, and one or more linebreaks with a single newline. Also strip leading/trailing whitespace.Parameters: text (str) – Urdu
textReturns: Returns a str
object containing normalized text.Return type: str Examples
>>> from urduhack.preprocessing import normalize_whitespace >>> text = "عراق اور شام اعلان کیا ہے دونوں جلد اپنے گے؟" >>> normalized_text = normalize_whitespace(text) >>> normalized_text عراق اور شام اعلان کیا ہے دونوں جلد اپنے گے؟
-
urduhack.preprocessing.
remove_punctuation
(text: str, marks=None) → str[source]¶ Remove punctuation from
text
by removing all instances ofmarks
.Parameters: Returns: returns a
str
object containing normalized text.Return type: Note
When
marks=None
, Python’s built-instr.translate()
is used to remove punctuation; otherwise, a regular expression is used instead. The former’s performance is about 5-10x faster.Examples
>>> from urduhack.preprocessing import remove_punctuation >>> output = remove_punctuation("کر ؟ سکتی ہے۔") کر سکتی ہے
-
urduhack.preprocessing.
remove_accents
(text: str) → str[source]¶ Remove accents from any accented unicode characters in
text
str, either by transforming them into ascii equivalents or removing them entirely.Parameters: text (str) – Urdu text Returns: str Examples
>>> from urduhack.preprocessing import remove_accents >>>text = "دالتِ عظمیٰ درخواست" >>> remove_accents(text)
‘دالت عظمی درخواست’
-
urduhack.preprocessing.
replace_urls
(text: str, replace_with='')[source]¶ Replace all URLs in
text
str withreplace_with
str.Parameters: Returns: Returns a
str
object replace url withreplace_with
text.Return type: Examples
>>> from urduhack.preprocessing import replace_urls >>> text = "20 www.gmail.com فیصد" >>> replace_urls(text) '20 فیصد'
-
urduhack.preprocessing.
replace_emails
(text: str, replace_with='')[source]¶ Replace all emails in
text
str withreplace_with
str.Parameters: Returns: Returns a
str
object replace emails withreplace_with
text.Return type: Examples
>>> text = "20 gunner@gmail.com فیصد" >>> from urduhack.preprocessing import replace_emails >>> replace_emails(text)
-
urduhack.preprocessing.
replace_numbers
(text: str, replace_with='')[source]¶ Replace all numbers in
text
str withreplace_with
str.Parameters: Returns: Returns a
str
object replace number withreplace_with
text.Return type: Examples
>>> from urduhack.preprocessing import replace_phone_numbers >>> text = "یعنی لائن آف کنٹرول پر فائربندی کا معاہدہ 555-123-4567 میں ہوا تھا" >>> replace_phone_numbers(text) 'یعنی لائن آف کنٹرول پر فائربندی کا معاہدہ میں ہوا تھا'
-
urduhack.preprocessing.
replace_phone_numbers
(text: str, replace_with='')[source]¶ Replace all phone numbers in
text
str withreplace_with
str.Parameters: Returns: Returns a
str
object replace number_no withreplace_with
text.Return type: Examples
>>> from urduhack.preprocessing import replace_numbers >>> text = "20 فیصد" >>> replace_numbers(text) ' فیصد'
-
urduhack.preprocessing.
replace_currency_symbols
(text: str, replace_with=None)[source]¶ Replace all currency symbols in
text
str with string specified byreplace_with
str.Parameters: Returns: Returns a
str
object containing normalized text.Return type: Examples
>>> from urduhack.preprocessing import replace_currency_symbols >>> text = "یعنی لائن آف کنٹرول پر فائربندی کا معاہدہ 2003 میں ہوا 33$ تھا۔" >>> replace_currency_symbols(text)
‘یعنی لائن آف کنٹرول پر فائربندی کا معاہدہ 2003 میں ہوا 33USD تھا۔’
Utils¶
Utils module¶
Collection of helper functions.
-
urduhack.utils.
pickle_load
(file_name: str) → Any[source]¶ Load the pickle file
Parameters: file_name (str) – file name Returns: python object type Return type: Any
-
urduhack.utils.
pickle_dump
(file_name: str, data: Any)[source]¶ Save the python object in pickle format
Parameters: - file_name (str) – file name
- data (Any) – Any data type
-
urduhack.utils.
download_from_url
(file_name: str, url: str, download_dir: str, cache_dir: Optional[str] = None)[source]¶ Download anything from HTTP url
Parameters: Raises: TypeError
– If any of the url, file_path and file_name are not str Type.
-
urduhack.utils.
remove_file
(file_name: str)[source]¶ Delete the local file
Parameters: file_name (str) – File to be deleted
Raises: TypeError
– If file_name is not str Type.FileNotFoundError
– If file_name does not exist
About¶
Authors¶
Ikram ALi (Core contributor)¶
A machine learning practitioner and an avid learner with professional experience in managing Python, PHP, Javascript projects and excellent (Machine learning / Deep learning) skills.
- Personal Web: https://akkefa.com/
- Github: https://github.com/akkefa
- linkedin: https://www.linkedin.com/in/akkefa/
Drop me a line at mrikram1989@gmail.com or call me at 92 3320 453648.
Goals¶
The author’s goal is to foster and support active development of urduhack library through:
- Continuous integration testing via Travis CI
- Publicized development activity on GitHub
- Regular releases to the Python Package Index
License¶
Urduhack is licensed under MIT License.
So if you still want to support urduhack library, please report issues here,
Release Notes¶
Note
Contributors please include release notes as needed or appropriate with your bug fixes, feature additions and tests.
0.2.2¶
Changes:
- Word tokenizer
- Urdu word tokenization functionality added. To covert normalize Urdu sentence into possible word tokens,
we need to use
urduhack.tokenization.word_tokenizer
function.
0.1.0¶
Changes:
- Normalize function
- Single function added to do all normalize stuff. To normalize some text,
all you need to do is to import this function
urduhack.normalize
and it will return a string with normalized characters both single and combined, proper spaces after digits and punctuations, also remove the diacritics.
- Sentence Tokenizer
- Urdu sentence tokenization functionality added. To covert raw Urdu text into possible sentences,
we need to use
urduhack.tokenization.sentence_tokenizer
function.
Bug fixes:
- Fixed bugs in
remove_diacritics()
0.0.2¶
Changes:
- Character Level Normalization
- The
urduhack.normalization.character
module provides the functionality to replace wrong arabic characters with correct urdu characters.
- Space Normalization
- The
urduhack.normalization.space.util
module provides functionality to put proper spaces before and after numeric digits, urdu digits and punctuations (urdu text).
- Diacritics Removal
- The
urduhack.utils.text.remove_diacritics
module in the UrduHack provides the functionality to remove Urdu diacritics from text. It is an important step in pre-processing of the Urdu data.
0.0.1¶
Changes:
- Urdu character normalization api added.
- Urdu space normalization utilities functionality added.
- urdu characters correct unicode ranges added.
Deprecations and removals¶
This page lists urduhack features that are deprecated, or have been removed in past major releases, and gives the alternatives to use instead.