Text PreProcessing¶

The pre-processing of Urdu text is necessary to make it useful for the machine learning tasks. This module provides the following functionality:

Normalize whitespace

Put Spaces Before & After Digits

Put Spaces Before & After English Words

Put Spaces Before & After Urdu Punctuations

Replace urls

Replace emails

Replace number

Replace phone_number

Replace currency_symbols

You can look for all the different functions that come with pre-process module in the reference here preprocess.

urduhack.preprocessing.digits_space(text: str) → str[source]¶

Add spaces before|after numeric and urdu digits

Parameters:	text (str) – `Urdu` text
Returns:	Returns a `str` object containing normalized text.
Return type:	str

Examples

>>> from urduhack.preprocessing import digits_space
>>> text = "20فیصد"
>>> normalized_text = digits_space(text)
>>> normalized_text
20 فیصد

urduhack.preprocessing.english_characters_space(text: str) → str[source]¶

Functionality to add spaces before and after English words in the given Urdu text. It is an important step in normalization of the Urdu data.

this function returns a String object which contains the original text with spaces before & after English words.

Parameters:	text (str) – `Urdu` text
Returns:	Returns a `str` object containing normalized text.
Return type:	str

Examples

>>> from urduhack.preprocessing import english_characters_space
>>> text = "خاتون Aliyaنے بچوںUzma and Aliyaکے قتل کا اعترافConfession کیا ہے۔"
>>> normalized_text = english_characters_space(text)
>>> normalized_text
خاتون Aliya نے بچوں Uzma and Aliya کے قتل کا اعتراف Confession کیا ہے۔

urduhack.preprocessing.all_punctuations_space(text: str) → str[source]¶

Add spaces after punctuations used in urdu writing

Parameters:	text (str) – `Urdu` text
Returns:	Returns a `str` object containing normalized text.
Return type:	str

urduhack.preprocessing.preprocess(text: str) → str[source]¶

To preprocess some text, all you need to do pass unicode text. It will return a str with proper spaces after digits and punctuations.

Parameters:	text (str) – `Urdu` text
Returns:	urdu text
Return type:	str
Raises:	`TypeError` – If text param is not not str Type.

Examples

>>> from urduhack.preprocessing import preprocess
>>> text = "اَباُوگل پاکستان ﻤﯿﮟ 20 سال ﺳﮯ ، وسائل کی کوئی کمی نہیں ﮨﮯ۔"
>>> normalized_text = preprocess(text)
>>> # The text now contains proper spaces after digits and punctuations,
>>> # normalized characters and no diacritics!
>>> normalized_text
اباوگل پاکستان ﻤﯿﮟ 20 سال ﺳﮯ ، وسائل کی کوئی کمی نہیں ﮨﮯ ۔

urduhack.preprocessing.normalize_whitespace(text: str)[source]¶

Given text str, replace one or more spacings with a single space, and one or more linebreaks with a single newline. Also strip leading/trailing whitespace.

Parameters:	text (str) – `Urdu` text
Returns:	Returns a `str` object containing normalized text.
Return type:	str

Examples

>>> from urduhack.preprocessing import normalize_whitespace
>>> text = "عراق اور شام     اعلان کیا ہے دونوں         جلد اپنے     گے؟"
>>> normalized_text = normalize_whitespace(text)
>>> normalized_text
عراق اور شام اعلان کیا ہے دونوں جلد اپنے گے؟

urduhack.preprocessing.remove_punctuation(text: str, marks=None) → str[source]¶

Remove punctuation from text by removing all instances of marks.

Parameters:	text (str) – Urdu text marks (str) – If specified, remove only the characters in this string, e.g. `marks=',;:'` removes commas, semi-colons, and colons. Otherwise, all punctuation marks are removed.
Returns:	returns a `str` object containing normalized text.
Return type:	str

Note

When marks=None, Python’s built-in str.translate() is used to remove punctuation; otherwise, a regular expression is used instead. The former’s performance is about 5-10x faster.

Examples

>>> from urduhack.preprocessing import remove_punctuation
>>> output = remove_punctuation("کر ؟ سکتی ہے۔")
کر سکتی ہے

urduhack.preprocessing.remove_accents(text: str) → str[source]¶

Remove accents from any accented unicode characters in text str, either by transforming them into ascii equivalents or removing them entirely.

Parameters:	text (str) – Urdu text
Returns:	str

Examples

>>> from urduhack.preprocessing import remove_accents
>>>text = "دالتِ عظمیٰ درخواست"
>>> remove_accents(text)

‘دالت عظمی درخواست’

urduhack.preprocessing.replace_urls(text: str, replace_with='')[source]¶

Replace all URLs in text str with replace_with str.

Parameters:	text (str) – `Urdu` text replace_with (str) – Replace string
Returns:	Returns a `str` object replace url with `replace_with` text.
Return type:	str

Examples

>>> from urduhack.preprocessing import replace_urls
>>> text = "20 www.gmail.com  فیصد"
>>> replace_urls(text)
'20  فیصد'

urduhack.preprocessing.replace_emails(text: str, replace_with='')[source]¶

Replace all emails in text str with replace_with str.

Parameters:	text (str) – `Urdu` text replace_with (str) – Replace string
Returns:	Returns a `str` object replace emails with `replace_with` text.
Return type:	str

Examples

>>> text = "20 gunner@gmail.com  فیصد"
>>> from urduhack.preprocessing import replace_emails
>>> replace_emails(text)

urduhack.preprocessing.replace_numbers(text: str, replace_with='')[source]¶

Replace all numbers in text str with replace_with str.

Parameters:	text (str) – `Urdu` text replace_with (str) – Replace string
Returns:	Returns a `str` object replace number with `replace_with` text.
Return type:	str

Examples

>>> from urduhack.preprocessing import replace_phone_numbers
>>> text = "یعنی لائن آف کنٹرول پر فائربندی کا معاہدہ 555-123-4567 میں ہوا تھا"
>>> replace_phone_numbers(text)
'یعنی لائن آف کنٹرول پر فائربندی کا معاہدہ میں ہوا تھا'

urduhack.preprocessing.replace_phone_numbers(text: str, replace_with='')[source]¶

Replace all phone numbers in text str with replace_with str.

Parameters:	text (str) – `Urdu` text replace_with (str) – Replace string
Returns:	Returns a `str` object replace number_no with `replace_with` text.
Return type:	str

Examples

>>> from urduhack.preprocessing import replace_numbers
>>> text = "20  فیصد"
>>> replace_numbers(text)
' فیصد'

urduhack.preprocessing.replace_currency_symbols(text: str, replace_with=None)[source]¶

Replace all currency symbols in text str with string specified by replace_with str.

Parameters:	text (str) – Raw text replace_with (str) – if None (default), replace symbols with their standard 3-letter abbreviations (e.g. ‘$’ with ‘USD’, ‘£’ with ‘GBP’); otherwise, pass in a string with which to replace all symbols (e.g. “CURRENCY”)
Returns:	Returns a `str` object containing normalized text.
Return type:	str

Examples

>>> from urduhack.preprocessing import replace_currency_symbols
>>> text = "یعنی لائن آف کنٹرول پر فائربندی کا معاہدہ 2003 میں ہوا 33$ تھا۔"
>>> replace_currency_symbols(text)

‘یعنی لائن آف کنٹرول پر فائربندی کا معاہدہ 2003 میں ہوا 33USD تھا۔’

urduhack.preprocessing.remove_english_alphabets(text: str)[source]¶

Removes English words and digits from a text

Parameters:	text (str) – Urdu text
Returns:	`str` object with english alphabets removed
Return type:	str