Text PreProcessing

The pre-processing of Urdu text is necessary to make it useful for the machine learning tasks. This module provides the following functionality:

  • Normalize whitespace
  • Put Spaces Before & After Digits
  • Put Spaces Before & After English Words
  • Put Spaces Before & After Urdu Punctuations
  • Replace urls
  • Replace emails
  • Replace number
  • Replace phone_number
  • Replace currency_symbols

You can look for all the different functions that come with pre-process module in the reference here preprocess.

urduhack.preprocessing.digits_space(text: str) → str[source]

Add spaces before|after numeric and urdu digits

Parameters:text (str) – Urdu text
Returns:Returns a str object containing normalized text.
Return type:str

Examples

>>> from urduhack.preprocessing import digits_space
>>> text = "20فیصد"
>>> normalized_text = digits_space(text)
>>> normalized_text
20 فیصد
urduhack.preprocessing.english_characters_space(text: str) → str[source]

Functionality to add spaces before and after English words in the given Urdu text. It is an important step in normalization of the Urdu data.

this function returns a String object which contains the original text with spaces before & after English words.

Parameters:text (str) – Urdu text
Returns:Returns a str object containing normalized text.
Return type:str

Examples

>>> from urduhack.preprocessing import english_characters_space
>>> text = "خاتون Aliyaنے بچوںUzma and Aliyaکے قتل کا اعترافConfession کیا ہے۔"
>>> normalized_text = english_characters_space(text)
>>> normalized_text
خاتون Aliya نے بچوں Uzma and Aliya کے قتل کا اعتراف Confession کیا ہے۔
urduhack.preprocessing.all_punctuations_space(text: str) → str[source]

Add spaces after punctuations used in urdu writing

Parameters:text (str) – Urdu text
Returns:Returns a str object containing normalized text.
Return type:str
urduhack.preprocessing.preprocess(text: str) → str[source]

To preprocess some text, all you need to do pass unicode text. It will return a str with proper spaces after digits and punctuations.

Parameters:text (str) – Urdu text
Returns:urdu text
Return type:str
Raises:TypeError – If text param is not not str Type.

Examples

>>> from urduhack.preprocessing import preprocess
>>> text = "اَباُوگل پاکستان ﻤﯿﮟ 20 سال ﺳﮯ ، وسائل کی کوئی کمی نہیں ﮨﮯ۔"
>>> normalized_text = preprocess(text)
>>> # The text now contains proper spaces after digits and punctuations,
>>> # normalized characters and no diacritics!
>>> normalized_text
اباوگل پاکستان ﻤﯿﮟ 20 سال ﺳﮯ ، وسائل کی کوئی کمی نہیں ﮨﮯ ۔
urduhack.preprocessing.normalize_whitespace(text: str)[source]

Given text str, replace one or more spacings with a single space, and one or more linebreaks with a single newline. Also strip leading/trailing whitespace.

Parameters:text (str) – Urdu text
Returns:Returns a str object containing normalized text.
Return type:str

Examples

>>> from urduhack.preprocessing import normalize_whitespace
>>> text = "عراق اور شام     اعلان کیا ہے دونوں         جلد اپنے     گے؟"
>>> normalized_text = normalize_whitespace(text)
>>> normalized_text
عراق اور شام اعلان کیا ہے دونوں جلد اپنے گے؟
urduhack.preprocessing.remove_punctuation(text: str, marks=None) → str[source]

Remove punctuation from text by removing all instances of marks.

Parameters:
  • text (str) – Urdu text
  • marks (str) – If specified, remove only the characters in this string, e.g. marks=',;:' removes commas, semi-colons, and colons. Otherwise, all punctuation marks are removed.
Returns:

returns a str object containing normalized text.

Return type:

str

Note

When marks=None, Python’s built-in str.translate() is used to remove punctuation; otherwise, a regular expression is used instead. The former’s performance is about 5-10x faster.

Examples

>>> from urduhack.preprocessing import remove_punctuation
>>> output = remove_punctuation("کر ؟ سکتی ہے۔")
کر سکتی ہے
urduhack.preprocessing.remove_accents(text: str) → str[source]

Remove accents from any accented unicode characters in text str, either by transforming them into ascii equivalents or removing them entirely.

Parameters:text (str) – Urdu text
Returns:str

Examples

>>> from urduhack.preprocessing import remove_accents
>>>text = "دالتِ عظمیٰ درخواست"
>>> remove_accents(text)

‘دالت عظمی درخواست’

urduhack.preprocessing.replace_urls(text: str, replace_with='')[source]

Replace all URLs in text str with replace_with str.

Parameters:
  • text (str) – Urdu text
  • replace_with (str) – Replace string
Returns:

Returns a str object replace url with replace_with text.

Return type:

str

Examples

>>> from urduhack.preprocessing import replace_urls
>>> text = "20 www.gmail.com  فیصد"
>>> replace_urls(text)
'20  فیصد'
urduhack.preprocessing.replace_emails(text: str, replace_with='')[source]

Replace all emails in text str with replace_with str.

Parameters:
  • text (str) – Urdu text
  • replace_with (str) – Replace string
Returns:

Returns a str object replace emails with replace_with text.

Return type:

str

Examples

>>> text = "20 gunner@gmail.com  فیصد"
>>> from urduhack.preprocessing import replace_emails
>>> replace_emails(text)
urduhack.preprocessing.replace_numbers(text: str, replace_with='')[source]

Replace all numbers in text str with replace_with str.

Parameters:
  • text (str) – Urdu text
  • replace_with (str) – Replace string
Returns:

Returns a str object replace number with replace_with text.

Return type:

str

Examples

>>> from urduhack.preprocessing import replace_phone_numbers
>>> text = "یعنی لائن آف کنٹرول پر فائربندی کا معاہدہ 555-123-4567 میں ہوا تھا"
>>> replace_phone_numbers(text)
'یعنی لائن آف کنٹرول پر فائربندی کا معاہدہ میں ہوا تھا'
urduhack.preprocessing.replace_phone_numbers(text: str, replace_with='')[source]

Replace all phone numbers in text str with replace_with str.

Parameters:
  • text (str) – Urdu text
  • replace_with (str) – Replace string
Returns:

Returns a str object replace number_no with replace_with text.

Return type:

str

Examples

>>> from urduhack.preprocessing import replace_numbers
>>> text = "20  فیصد"
>>> replace_numbers(text)
' فیصد'
urduhack.preprocessing.replace_currency_symbols(text: str, replace_with=None)[source]

Replace all currency symbols in text str with string specified by replace_with str.

Parameters:
  • text (str) – Raw text
  • replace_with (str) – if None (default), replace symbols with their standard 3-letter abbreviations (e.g. ‘$’ with ‘USD’, ‘£’ with ‘GBP’); otherwise, pass in a string with which to replace all symbols (e.g. “CURRENCY”)
Returns:

Returns a str object containing normalized text.

Return type:

str

Examples

>>> from urduhack.preprocessing import replace_currency_symbols
>>> text = "یعنی لائن آف کنٹرول پر فائربندی کا معاہدہ 2003 میں ہوا 33$ تھا۔"
>>> replace_currency_symbols(text)

‘یعنی لائن آف کنٹرول پر فائربندی کا معاہدہ 2003 میں ہوا 33USD تھا۔’

urduhack.preprocessing.remove_english_alphabets(text: str)[source]

Removes English words and digits from a text

Parameters:text (str) – Urdu text
Returns:str object with english alphabets removed
Return type:str