Urduhack

Urduhack is a NLP library for urdu language. It comes with a lot of battery included features to help you process Urdu data in the easiest way possible.

License: MIT Pypi Version Python Versions Wheel Documentation Status Travis CI build status Code coverage

Urduhack has different modules all of which serve a specific purpose. You can load any of them and check out their results by giving in your inputs. urduhack has got some magic functions that can make your life easier. You just need to access a particular module and get amazing results by giving in your data. Normalization, Tokenization and Preprocess are the main modules of Urduhack.

Our Goal

  • Academic users Easier experimentation to prove their hypothesis without coding from scratch.
  • NLP beginners Learn how to build an NLP project with production level code quality.
  • NLP developers Build a production level application within minutes.

Urduhack is maintained by Ikram Ali and Contributors.

Installation

Notes

Note

Urduhack is supported on the following Python versions

Python 3.8 3.7 3.6 3.4 2.7
Urduhack Yes Yes Yes    

Basic Installation

Note

Urduhack developed using Tensorflow. Its need Tensorflow cpu for prediction and for development and training the models its uses Tensorflow-gpu. following instructions will install Tensorflow

The easiest way to install urduhack is by pip install.

Installing with Tensorflow cpu version.:

$ pip install Urduhack[tf]

Installing with Tensorflow gpu version.:

$ pip install Urduhack[tf-gpu]

Package Dependencies

Having so many functionality, urduhack depends on a number of other packages. Try to avoid any kind of conflict. It is preferred that you create a virtual environment and install urduhack in that environment.

  • Tensorflow > 2.0.0 Use for training, evaluating and testing deep neural network model.
  • transformers Use for bert implementation for training and evaluation.
  • tensorflow-datasets Use for download and prepare the dataset,read it into a model using the tf.data.Dataset API.
  • Click With help of this library Urduhack commandline application developed.

Downloading Models

Pythonic Way

You can download model using Urduhack code.:

import urduhack
urduhack.download()
Command line

To download the models all you have to do is run this simple command in the command line.:

$ urduhack download

This command will download the models which will be used by urduhack.

Quickstart

Every python package needs an import statement so let’s do that first.:

>>>import urduhack

Overview

Normalization

The normalization of Urdu text is necessary to make it useful for the machine learning tasks. In the normalization module, the very basic problems faced when working with Urdu data are handled with ease and efficiency. All the problems and how normalization module handles them are listed below.

This modules fixes the problem of correct encodings for the Urdu characters as well as replace Arabic characters with correct Urdu characters. This module brings all the characters in the specified unicode range (0600-06FF) for Urdu language.

It also fixes the problem of joining of different Urdu words. By joining we mean that when space between two Urdu words is removed, they must not make a new word. Their rendering must not change and even after the removal of space they should look the same.

Tokenization

This module is another crucial part of the Urduhack. This module performs tokenization on sentence. It separates different sentence from each other and converts each string into a complete sentence token. Note here you must not confuse yourself with the word token. They are two completely different things.

This library provides state of art word tokenizer for Urdu Language. It takes care of the spaces and where to connect two urdu characters and where not to.

The tokenization of Urdu text is necessary to make it useful for the NLP tasks. This module provides the following functionality:

  • Sentence Tokenization
  • Word Tokenization

The tokenization of Urdu text is necessary to make it useful for the machine learning tasks. In the tokenization module, we solved the problem related to sentence and word tokenization.

Tutorial

CoNLL-U Format

We aspire to maintain data for all the tasks in CoNNL-U format. CoNLL-U format holds sentence and token level data along with their attributes. Below we will show how to use urduhack’s CoNLL module.

>>> from urduhack import CoNLL

To iterate over sentences in CoNLL-U format we will use iter_string() function.

>>> from urduhack.conll.tests.test_parser import CONLL_SENTENCE

It will yield a sentence in proper CoNLL-U format from which we can extract sentence level and token level attributes.

>>> for sentence in CoNLL.iter_string(CONLL_SENTENCE):
        sent_meta, tokens = sentence
        print(f"Sentence ID: {sent_meta['sent_id']}")
        print(f"Sentence Text: {sent_meta['text']}")
        for token in tokens:
            print(token)
        Sentence ID: test-s13
        Sentence Text: والدین معمولی زخمی ہوئے ہےں۔
        {'id': '1', 'text': 'والدین', 'lemma': 'والدین', 'upos': 'NOUN', 'xpos': 'NN', 'feats': 'Case=Acc|Gender=Masc|Number=Sing|Person=3', 'head': '4', 'deprel': 'nsubj', 'deps': '_', 'misc': 'Vib=0|Tam=0|ChunkId=NP|ChunkType=head'}
        {'id': '2', 'text': 'معمولی', 'lemma': 'معمولی', 'upos': 'ADJ', 'xpos': 'JJ', 'feats': 'Case=Nom', 'head': '3', 'deprel': 'advmod', 'deps': '_', 'misc': 'ChunkId=JJP|ChunkType=head'}
        {'id': '3', 'text': 'زخمی', 'lemma': 'زخمی', 'upos': 'ADJ', 'xpos': 'JJ', 'feats': 'Case=Nom|Gender=Masc|Number=Sing|Person=3', 'head': '4', 'deprel': 'compound', 'deps': '_', 'misc': 'Vib=0|Tam=0|ChunkId=JJP2|ChunkType=head'}
        {'id': '4', 'text': 'ہوئے', 'lemma': 'ہو', 'upos': 'VERB', 'xpos': 'VM', 'feats': 'Aspect=Perf|Number=Plur|Person=2|Polite=Form|VerbForm=Part|Voice=Act', 'head': '0', 'deprel': 'root', 'deps': '_', 'misc': 'Vib=یا|Tam=yA|ChunkId=VGF|ChunkType=head|Stype=declarative'}
        {'id': '5', 'text': 'ہےں', 'lemma': 'ہے', 'upos': 'AUX', 'xpos': 'VAUX', 'feats': 'Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin', 'head': '4', 'deprel': 'aux', 'deps': '_', 'misc': 'SpaceAfter=No|Vib=ہے|Tam=hE|ChunkId=VGF|ChunkType=child'}
        {'id': '6', 'text': '۔', 'lemma': '۔', 'upos': 'PUNCT', 'xpos': 'SYM', 'feats': '_', 'head': '4', 'deprel': 'punct', 'deps': '_', 'misc': 'ChunkId=VGF|ChunkType=child'}

To load a file in ConLL-U format, we will use urduhack.CoNLL.load_file() function.

>>> sentences = ConLL.load_file("urdu_text.conll")
>>> for sentence in sentences:
        sent_meta, tokens = sentence
        print(f"Sentence ID: {sent_meta['sent_id']}")
        print(f"Sentence Text: {sent_meta['text']}")
        for token in tokens:
            print(token)
        Sentence ID: test-s13
        Sentence Text: والدین معمولی زخمی ہوئے ہےں۔
        {'id': '1', 'text': 'والدین', 'lemma': 'والدین', 'upos': 'NOUN', 'xpos': 'NN', 'feats': 'Case=Acc|Gender=Masc|Number=Sing|Person=3', 'head': '4', 'deprel': 'nsubj', 'deps': '_', 'misc': 'Vib=0|Tam=0|ChunkId=NP|ChunkType=head'}
        {'id': '2', 'text': 'معمولی', 'lemma': 'معمولی', 'upos': 'ADJ', 'xpos': 'JJ', 'feats': 'Case=Nom', 'head': '3', 'deprel': 'advmod', 'deps': '_', 'misc': 'ChunkId=JJP|ChunkType=head'}
        {'id': '3', 'text': 'زخمی', 'lemma': 'زخمی', 'upos': 'ADJ', 'xpos': 'JJ', 'feats': 'Case=Nom|Gender=Masc|Number=Sing|Person=3', 'head': '4', 'deprel': 'compound', 'deps': '_', 'misc': 'Vib=0|Tam=0|ChunkId=JJP2|ChunkType=head'}
        {'id': '4', 'text': 'ہوئے', 'lemma': 'ہو', 'upos': 'VERB', 'xpos': 'VM', 'feats': 'Aspect=Perf|Number=Plur|Person=2|Polite=Form|VerbForm=Part|Voice=Act', 'head': '0', 'deprel': 'root', 'deps': '_', 'misc': 'Vib=یا|Tam=yA|ChunkId=VGF|ChunkType=head|Stype=declarative'}
        {'id': '5', 'text': 'ہےں', 'lemma': 'ہے', 'upos': 'AUX', 'xpos': 'VAUX', 'feats': 'Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin', 'head': '4', 'deprel': 'aux', 'deps': '_', 'misc': 'SpaceAfter=No|Vib=ہے|Tam=hE|ChunkId=VGF|ChunkType=child'}
        {'id': '6', 'text': '۔', 'lemma': '۔', 'upos': 'PUNCT', 'xpos': 'SYM', 'feats': '_', 'head': '4', 'deprel': 'punct', 'deps': '_', 'misc': 'ChunkId=VGF|ChunkType=child'}
Pipeline Module

Pipeline is a special module in urduhack. It’s importance can be realized by the fact that it performs operation at Document, Sentence and Token level. We can convert a document to sentence and a sentence into tokens in one go using the pipeline module. After that we can run models or any other operation at the document, sentence and token levels. Now we will go into these steps one by on.

Document

We can get the document using pipeline module.

>>> from urduhack import Pipeline
>>> nlp = Pipeline()
>>> text = """
گزشتہ ایک روز کے دوران کورونا کے سبب 118 اموات ہوئیں جس کے بعد اموات کا مجموعہ 3 ہزار 93 ہوگیا ہے۔
سب سے زیادہ اموات بھی پنجاب میں ہوئی ہیں جہاں ایک ہزار 202 افراد جان کی بازی ہار چکے ہیں۔
سندھ میں 916، خیبر پختونخوا میں 755، اسلام آباد میں 94، گلگت بلتستان میں 18، بلوچستان میں 93 اور ا?زاد کشمیر میں 15 افراد کورونا وائرس سے جاں بحق ہو چکے ہیں۔
"""
>>> doc = nlp(text)
>>> print(doc.text)
گزشتہ ایک روز کے دوران کورونا کے سبب 118 اموات ہوئیں جس کے بعد اموات کا مجموعہ 3 ہزار 93 ہوگیا ہے۔
سب سے زیادہ اموات بھی پنجاب میں ہوئی ہیں جہاں ایک ہزار 202 افراد جان کی بازی ہار چکے ہیں۔
سندھ میں 916، خیبر پختونخوا میں 755، اسلام آباد میں 94، گلگت بلتستان میں 18، بلوچستان میں 93 اور ا?زاد کشمیر میں 15 افراد کورونا وائرس سے جاں بحق ہو چکے ہیں۔
Sentence

Now to get the sentences from the Document.

>>> for sentence in doc.sentences:
        print(sentence.text)
گزشتہ ایک روز کے دوران کورونا کے سبب 118 اموات ہوئیں جس کے بعد اموات کا مجموعہ 3 ہزار 93 ہوگیا ہے
سب سے زیادہ اموات بھی پنجاب میں ہوئی ہیں
جہاں ایک ہزار 202 افراد جان کی بازی ہار چکے ہیں۔
سندھ میں 916، خیبر پختونخوا میں 755، اسلام آباد میں 94، گلگت بلتستان میں 18، بلوچستان میں 93 اور ا?زاد کشمیر میں 15 افراد کورونا وائرس سے جاں بحق ہو چکے ہی
گزشتہ ایک روز کے دوران کورونا کے سبب 118 اموات ہوئیں جس کے بعد اموات کا مجموعہ 3 ہزار 93 ہوگیا ہے
Word

To get words from sentence.

>>> for word in sentence.words:
        print(word.text)
 گزشتہ
 ایک
 روز
 کے
 دوران
 کورونا
 کے
 سبب
 118
 اموات
 ہوئیںجس
 کے
 بعد
 اموات
 کا
 مجموعہ
 3
 ہزار
 93
 ہو
 گیا
 ہے۔

Reference

CoNLL-U Format

This module reads and parse data in the standard CONLL-U format as provided in universal dependencies. CONLL-U is a standard format followed to annotate data at sentence level and at word/token level. Annotations in CONLL-U format fulfil the below points:

  1. Word lines contain the annotations of a word/token in 10 fields are separated by single tab characters
  2. Blank lines mark sentence boundaries
  3. Comment lines start with hash (#)

Each word/token has 10 fields defined in the CONLL-U format. Each field represents different attributes of the token whose details are given below:

Fields
1. ID:
ID represents the word/token index in the sentence
2. FORM:
Word/token form or punctuation symbol used in the sentence
3. LEMMA:
Root/stem of the word
4. UPOS:
Universal Part-of-Speech tag
5. XPOS:
Language specific part-of-speed tag. underscore if not available
6. FEATS:
List of morphological features from the universal features inventory or from a defined language specific extension
7. HEAD:
Head of the current word which is wither the value of ID or zero.
8. DEPREL:
Universal dependencies relation to the HEAD (root if HEAD=0) or a defined language specific subtype of one.
9. DEPS:
Enhanced dependency graph in the form of a list of head-deprel pairs
10. MISC:
Any other annotation apart from the above mentioned fields
class urduhack.conll.CoNLL[source]

A Conll class to easily load conll-u formats. This module can also load resources by iterating over string. This module is the main entrance to conll’s functionalities.

static get_fields() → List[str][source]

Get the list of conll fields

Returns:Return list of conll fields
Return type:List[str]
static iter_file(file_name: str) → Iterator[Tuple][source]

Iterate over a CoNLL-U file’s sentences.

Parameters:

file_name (str) – The name of the file whose sentences should be iterated over.

Yields:

Iterator[Tuple] – The sentences that make up the CoNLL-U file.

Raises:
  • IOError – If there is an error opening the file.
  • ParseError – If there is an error parsing the input into a Conll object.
static iter_string(text: str) → Iterator[Tuple][source]

Iterate over a CoNLL-U string’s sentences.

Use this method if you only need to iterate over the CoNLL-U file once and do not need to create or store the Conll object.

Parameters:text (str) – The CoNLL-U string.
Yields:Iterator[Tuple] – The sentences that make up the CoNLL-U file.
Raises:ParseError – If there is an error parsing the input into a Conll object.
static load_file(file_name: str) → List[Tuple][source]

Load a CoNLL-U file given its location.

Parameters:

file_name (str) – The location of the file.

Returns:

A Conll object equivalent to the provided file.

Return type:

List[Tuple]

Raises:
  • IOError – If there is an error opening the given filename.
  • ValueError – If there is an error parsing the input into a Conll object.

Normalization

The normalization of Urdu text is necessary to make it useful for the machine learning tasks. In the normalize module, the very basic problems faced when working with Urdu data are handled with ease and efficiency. All the problems and how normalize module handles them are listed below.

This modules fixes the problem of correct encodings for the Urdu characters as well as replace Arabic characters with correct Urdu characters. This module brings all the characters in the specified unicode range (0600-06FF) for Urdu language.

It also fixes the problem of joining of different Urdu words. By joining we mean that when space between two Urdu words is removed, they must not make a new word. Their rendering must not change and even after the removal of space they should look the same.

You can use the library to normalize the Urdu text for correct unicode characters. By normalization we mean to end the confusion between Urdu and Arabic characters, to replace two words with one word keeping in mind the context they are used in. Like the character ‘ﺁ’ and ‘ﺂ’ are to be replaced by ‘آ’. All this is done using regular expressions.

The normalization of Urdu text is necessary to make it useful for the machine learning tasks. This module provides the following functionality:

  • Normalizing Single Characters
  • Normalizing Combine Characters
  • Put Spaces Before & After Digits
  • Put Spaces After Urdu Punctuations
  • Put Spaces Before & After English Words
  • Removal of Diacritics from Urdu Text
urduhack.normalization.normalize_characters(text: str) → str[source]

The most important module in the UrduHack is the character module, defined in the module with the same name. You can use this module separately to normalize a piece of text to a proper specified Urdu range (0600-06FF). To get an understanding of how this module works, one needs to understand unicode. Every character has a unicode. You can search for any character unicode from any language you will find it. No two characters can have the same unicode. This module works with reference to the unicodes. Now as urdu language has its roots in Arabic, Parsian and Turkish. So we have to deal with all those characters and convert them to a normal urdu character. To get a bit more of what the above explanation means is.:

>>> all_fes = ['ﻑ', 'ﻒ', 'ﻓ', 'ﻔ', ]
>>> urdu_fe = 'ف'

All the characters in all_fes are same but they come from different languages and they all have different unicodes. Now as computers deal with numbers, same character appearing in more than one place in a different language will have a different unicode and that will create confusion which will create problems in understanding the context of the data. character module will eliminate this problem by replacing all the characters in all_fes by urdu_fe.

This provides the functionality to replace wrong arabic characters with correct urdu characters and fixed the combine|join characters issue.

Replace urdu text characters with correct unicode characters.

Parameters:text (str) – Urdu text
Returns:Returns a str object containing normalized text.
Return type:str

Examples

>>> from urduhack.normalization import normalize_characters
>>> # Text containing characters from Arabic Unicode block
>>> text = "مجھ کو جو توڑا ﮔیا تھا"
>>> normalized_text = normalize_characters(text)
>>> # Normalized text - Arabic characters are now replaced with Urdu characters
>>> normalized_text
مجھ کو جو توڑا گیا تھا
urduhack.normalization.normalize_combine_characters(text: str) → str[source]

To normalize combine characters with single character unicode text, use the normalize_combine_characters() function in the character module.

Replace combine|join urdu characters with single unicode character

Parameters:text (str) – Urdu text
Returns:Returns a str object containing normalized text.
Return type:str

Examples

>>> from urduhack.normalization import normalize_combine_characters
>>> # In the following string, Alif ('ا') and Hamza ('ٔ ') are separate characters
>>> text = "جرأت"
>>> normalized_text = normalize_combine_characters(text)
>>> # Now Alif and Hamza are replaced by a Single Urdu Unicode Character!
>>> normalized_text
جرأت
urduhack.normalization.english_characters_space(text: str) → str[source]

Functionality to add spaces before and after English words in the given Urdu text. It is an important step in normalization of the Urdu data.

this function returns a String object which contains the original text with spaces before & after English words.

Parameters:text (str) – Urdu text
Returns:Returns a str object containing normalized text.
Return type:str

Examples

>>> from urduhack.normalization import english_characters_space
>>> text = "خاتون Aliyaنے بچوںUzma and Aliyaکے قتل کا اعترافConfession کیا ہے۔"
>>> normalized_text = english_characters_space(text)
>>> normalized_text
خاتون Aliya نے بچوں Uzma and Aliya کے قتل کا اعتراف Confession کیا ہے۔
urduhack.normalization.punctuations_space(text: str) → str[source]

Add spaces after punctuations used in urdu writing

Parameters:text (str) – Urdu text
Returns:Returns a str object containing normalized text.
Return type:str

Examples

>>> from urduhack.normalization import punctuations_space
>>> text = "ہوتا ہے   ۔  ٹائپ"
>>> normalized_text = punctuations_space(text)
>>> normalized_text
ہوتا ہے۔ ٹائپ
urduhack.normalization.digits_space(text: str) → str[source]

Add spaces before|after numeric and urdu digits

Parameters:text (str) – Urdu text
Returns:Returns a str object containing normalized text.
Return type:str

Examples

>>> from urduhack.normalization import digits_space
>>> text = "20فیصد"
>>> normalized_text = digits_space(text)
>>> normalized_text
20 فیصد
urduhack.normalization.remove_diacritics(text: str) → str[source]

Remove urdu diacritics from text. It is an important step in pre-processing of the Urdu data. This function returns a String object which contains the original text minus Urdu diacritics.

Parameters:text (str) – Urdu text
Returns:Returns a str object containing normalized text.
Return type:str

Examples

>>> from urduhack.normalization import remove_diacritics
>>> text = "شیرِ پنجاب"
>>> normalized_text = remove_diacritics(text)
>>> normalized_text
شیر پنجاب
urduhack.normalization.normalize(text: str) → str[source]

To normalize some text, all you need to do pass unicode text. It will return a str with normalized characters both single and combined, proper spaces after digits and punctuations and diacritics removed.

Parameters:text (str) – Urdu text
Returns:Normalized urdu text
Return type:str
Raises:TypeError – If text param is not not str Type.

Examples

>>> from urduhack import normalize
>>> text = "اَباُوگل پاکستان ﻤﯿﮟ20سال ﺳﮯ ، وسائل کی کوئی کمی نہیں ﮨﮯ۔"
>>> normalized_text = normalize(text)
>>> # The text now contains proper spaces after digits and punctuations,
>>> # normalized characters and no diacritics!
>>> normalized_text
اباوگل پاکستان ﻤﯿﮟ 20 سال ﺳﮯ ، وسائل کی کوئی کمی نہیں ﮨﮯ ۔

Tokenization

This module is another crucial part of the Urduhack. This module performs tokenization on sentence. It separates different sentence from each other and converts each string into a complete sentence token. Note here you must not confuse yourself with the word token. They are two completely different things.

This library provides state of art word tokenizer for Urdu Language. It takes care of the spaces and where to connect two urdu characters and where not to.

The tokenization of Urdu text is necessary to make it useful for the NLP tasks. This module provides the following functionality:

  • Sentence Tokenization
  • Word Tokenization

The tokenization of Urdu text is necessary to make it useful for the machine learning tasks. In the tokenization module, we solved the problem related to sentence and word tokenization.

urduhack.tokenization.sentence_tokenizer(text: str) → List[str][source]

Convert Urdu text into possible sentences. If successful, this function returns a List object containing multiple urdu String sentences.

Parameters:text (str) – Urdu text
Returns:Returns a list object containing multiple urdu sentences type str.
Return type:list
Raises:TypeError – If text is not a str Type

Examples

>>> from urduhack.tokenization import sentence_tokenizer
>>> text = "عراق اور شام نے اعلان کیا ہے دونوں ممالک جلد اپنے اپنے سفیروں کو واپس بغداد اور دمشق بھیج دیں گے؟"
>>> sentences = sentence_tokenizer(text)
>>> sentences
["دونوں ممالک جلد اپنے اپنے سفیروں کو واپس بغداد اور دمشق بھیج دیں گے؟" ,"عراق اور شام نے اعلان کیا ہے۔"]
urduhack.tokenization.word_tokenizer(sentence: str, max_len: int = 256) → List[str][source]

To convert the raw Urdu text into tokens, we need to use word_tokenizer() function. Before doing this we need to normalize our sentence as well. For normalizing the urdu sentence use urduhack.normalization.normalize() function. If the word_tokenizer runs successfully, this function returns a List object containing urdu String word tokens.

Parameters:
  • sentence (str) – urdu text or list of text
  • max_len (int) – Maximum text length supported by model
Returns:

Returns a List[str] containing urdu tokens

Return type:

list

Examples

>>> sent = 'عراق اور شام نے اعلان کیا ہے دونوں ممالک جلد اپنے اپنے سفیروں کو واپس بغداد اور دمشق بھیج دیں گے؟'
>>> from urduhack.tokenization import word_tokenizer
>>> word_tokenizer(sent)
Tokens:  ['عراق', 'اور', 'شام', 'نے', 'اعلان', 'کیا', 'ہے', 'دونوں', 'ممالک'
, 'جلد', 'اپنے', 'اپنے', 'سفیروں', 'کو', 'واپس', 'بغداد', 'اور', 'دمشق', 'بھیج', 'دیں', 'گے؟']

Text PreProcessing

The pre-processing of Urdu text is necessary to make it useful for the machine learning tasks. This module provides the following functionality:

  • Normalize whitespace
  • Replace urls
  • Replace emails
  • Replace number
  • Replace phone_number
  • Replace currency_symbols

You can look for all the different functions that come with pre-process module in the reference here preprocess.

urduhack.preprocessing.normalize_whitespace(text: str)[source]

Given text str, replace one or more spacings with a single space, and one or more linebreaks with a single newline. Also strip leading/trailing whitespace.

Parameters:text (str) – Urdu text
Returns:Returns a str object containing normalized text.
Return type:str

Examples

>>> from urduhack.preprocessing import normalize_whitespace
>>> text = "عراق اور شام     اعلان کیا ہے دونوں         جلد اپنے     گے؟"
>>> normalized_text = normalize_whitespace(text)
>>> normalized_text
عراق اور شام اعلان کیا ہے دونوں جلد اپنے گے؟
urduhack.preprocessing.remove_punctuation(text: str, marks=None) → str[source]

Remove punctuation from text by removing all instances of marks.

Parameters:
  • text (str) – Urdu text
  • marks (str) – If specified, remove only the characters in this string, e.g. marks=',;:' removes commas, semi-colons, and colons. Otherwise, all punctuation marks are removed.
Returns:

returns a str object containing normalized text.

Return type:

str

Note

When marks=None, Python’s built-in str.translate() is used to remove punctuation; otherwise, a regular expression is used instead. The former’s performance is about 5-10x faster.

Examples

>>> from urduhack.preprocessing import remove_punctuation
>>> output = remove_punctuation("کر ؟ سکتی ہے۔")
کر سکتی ہے
urduhack.preprocessing.remove_accents(text: str) → str[source]

Remove accents from any accented unicode characters in text str, either by transforming them into ascii equivalents or removing them entirely.

Parameters:text (str) – Urdu text
Returns:str

Examples

>>> from urduhack.preprocessing import remove_accents
>>>text = "دالتِ عظمیٰ درخواست"
>>> remove_accents(text)

‘دالت عظمی درخواست’

urduhack.preprocessing.replace_urls(text: str, replace_with='')[source]

Replace all URLs in text str with replace_with str.

Parameters:
  • text (str) – Urdu text
  • replace_with (str) – Replace string
Returns:

Returns a str object replace url with replace_with text.

Return type:

str

Examples

>>> from urduhack.preprocessing import replace_urls
>>> text = "20 www.gmail.com  فیصد"
>>> replace_urls(text)
'20  فیصد'
urduhack.preprocessing.replace_emails(text: str, replace_with='')[source]

Replace all emails in text str with replace_with str.

Parameters:
  • text (str) – Urdu text
  • replace_with (str) – Replace string
Returns:

Returns a str object replace emails with replace_with text.

Return type:

str

Examples

>>> text = "20 gunner@gmail.com  فیصد"
>>> from urduhack.preprocessing import replace_emails
>>> replace_emails(text)
urduhack.preprocessing.replace_numbers(text: str, replace_with='')[source]

Replace all numbers in text str with replace_with str.

Parameters:
  • text (str) – Urdu text
  • replace_with (str) – Replace string
Returns:

Returns a str object replace number with replace_with text.

Return type:

str

Examples

>>> from urduhack.preprocessing import replace_phone_numbers
>>> text = "یعنی لائن آف کنٹرول پر فائربندی کا معاہدہ 555-123-4567 میں ہوا تھا"
>>> replace_phone_numbers(text)
'یعنی لائن آف کنٹرول پر فائربندی کا معاہدہ میں ہوا تھا'
urduhack.preprocessing.replace_phone_numbers(text: str, replace_with='')[source]

Replace all phone numbers in text str with replace_with str.

Parameters:
  • text (str) – Urdu text
  • replace_with (str) – Replace string
Returns:

Returns a str object replace number_no with replace_with text.

Return type:

str

Examples

>>> from urduhack.preprocessing import replace_numbers
>>> text = "20  فیصد"
>>> replace_numbers(text)
' فیصد'
urduhack.preprocessing.replace_currency_symbols(text: str, replace_with=None)[source]

Replace all currency symbols in text str with string specified by replace_with str.

Parameters:
  • text (str) – Raw text
  • replace_with (str) – if None (default), replace symbols with their standard 3-letter abbreviations (e.g. ‘$’ with ‘USD’, ‘£’ with ‘GBP’); otherwise, pass in a string with which to replace all symbols (e.g. “CURRENCY”)
Returns:

Returns a str object containing normalized text.

Return type:

str

Examples

>>> from urduhack.preprocessing import replace_currency_symbols
>>> text = "یعنی لائن آف کنٹرول پر فائربندی کا معاہدہ 2003 میں ہوا 33$ تھا۔"
>>> replace_currency_symbols(text)

‘یعنی لائن آف کنٹرول پر فائربندی کا معاہدہ 2003 میں ہوا 33USD تھا۔’

urduhack.preprocessing.remove_english_alphabets(text: str)[source]

Removes English words and digits from a text

Parameters:text (str) – Urdu text
Returns:str object with english alphabets removed
Return type:str

Utils

Utils module

Collection of helper functions.

urduhack.utils.pickle_load(file_name: str) → Any[source]

Load the pickle file

Parameters:file_name (str) – file name
Returns:python object type
Return type:Any
urduhack.utils.pickle_dump(file_name: str, data: Any)[source]

Save the python object in pickle format

Parameters:
  • file_name (str) – file name
  • data (Any) – Any data type
urduhack.utils.download_from_url(file_name: str, url: str, download_dir: str, cache_dir: Optional[str] = None)[source]

Download anything from HTTP url

Parameters:
  • file_name (str) – Save file as provided file name
  • url (str) – HTTP url
  • download_dir (str) – location to store file
  • cache_dir (str) – Main download dir
Raises:

TypeError – If any of the url, file_path and file_name are not str Type.

urduhack.utils.remove_file(file_name: str)[source]

Delete the local file

Parameters:

file_name (str) – File to be deleted

Raises:

About

Authors

Ikram ALi (Core contributor)

A machine learning practitioner and an avid learner with professional experience in managing Python, PHP, Javascript projects and excellent (Machine learning / Deep learning) skills.

Drop me a line at mrikram1989@gmail.com or call me at 92 3320 453648.

Goals

The author’s goal is to foster and support active development of urduhack library through:

License

Urduhack is licensed under MIT License.

So if you still want to support urduhack library, please report issues here,

Release Notes

Note

Contributors please include release notes as needed or appropriate with your bug fixes, feature additions and tests.

0.2.2

Changes:

  • Word tokenizer
    Urdu word tokenization functionality added. To covert normalize Urdu sentence into possible word tokens, we need to use urduhack.tokenization.word_tokenizer function.

0.1.0

Changes:

  • Normalize function
    Single function added to do all normalize stuff. To normalize some text, all you need to do is to import this function urduhack.normalize and it will return a string with normalized characters both single and combined, proper spaces after digits and punctuations, also remove the diacritics.
  • Sentence Tokenizer
    Urdu sentence tokenization functionality added. To covert raw Urdu text into possible sentences, we need to use urduhack.tokenization.sentence_tokenizer function.

Bug fixes:

  • Fixed bugs in remove_diacritics()

0.0.2

Changes:

  • Character Level Normalization
    The urduhack.normalization.character module provides the functionality to replace wrong arabic characters with correct urdu characters.
  • Space Normalization
    The urduhack.normalization.space.util module provides functionality to put proper spaces before and after numeric digits, urdu digits and punctuations (urdu text).
  • Diacritics Removal
    The urduhack.utils.text.remove_diacritics module in the UrduHack provides the functionality to remove Urdu diacritics from text. It is an important step in pre-processing of the Urdu data.

0.0.1

Changes:

  • Urdu character normalization api added.
  • Urdu space normalization utilities functionality added.
  • urdu characters correct unicode ranges added.

Deprecations and removals

This page lists urduhack features that are deprecated, or have been removed in past major releases, and gives the alternatives to use instead.

Indices and tables