Text PreProcessing¶
The pre-processing of Urdu text is necessary to make it useful for the machine learning tasks. This module provides the following functionality:
- Normalize whitespace
- Put Spaces Before & After Digits
- Put Spaces Before & After English Words
- Put Spaces Before & After Urdu Punctuations
- Replace urls
- Replace emails
- Replace number
- Replace phone_number
- Replace currency_symbols
You can look for all the different functions that come with pre-process
module in the reference here preprocess
.
-
urduhack.preprocessing.
digits_space
(text: str) → str[source]¶ Add spaces before|after numeric and urdu digits
Parameters: text (str) – Urdu
textReturns: Returns a str
object containing normalized text.Return type: str Examples
>>> from urduhack.preprocessing import digits_space >>> text = "20فیصد" >>> normalized_text = digits_space(text) >>> normalized_text 20 فیصد
-
urduhack.preprocessing.
english_characters_space
(text: str) → str[source]¶ Functionality to add spaces before and after English words in the given Urdu text. It is an important step in normalization of the Urdu data.
this function returns a
String
object which contains the original text with spaces before & after English words.Parameters: text (str) – Urdu
textReturns: Returns a str
object containing normalized text.Return type: str Examples
>>> from urduhack.preprocessing import english_characters_space >>> text = "خاتون Aliyaنے بچوںUzma and Aliyaکے قتل کا اعترافConfession کیا ہے۔" >>> normalized_text = english_characters_space(text) >>> normalized_text خاتون Aliya نے بچوں Uzma and Aliya کے قتل کا اعتراف Confession کیا ہے۔
-
urduhack.preprocessing.
all_punctuations_space
(text: str) → str[source]¶ Add spaces after punctuations used in
urdu
writingParameters: text (str) – Urdu
textReturns: Returns a str
object containing normalized text.Return type: str
-
urduhack.preprocessing.
preprocess
(text: str) → str[source]¶ To preprocess some text, all you need to do pass
unicode
text. It will return astr
with proper spaces after digits and punctuations.Parameters: text (str) – Urdu
textReturns: urdu text Return type: str Raises: TypeError
– If text param is not not str Type.Examples
>>> from urduhack.preprocessing import preprocess >>> text = "اَباُوگل پاکستان ﻤﯿﮟ 20 سال ﺳﮯ ، وسائل کی کوئی کمی نہیں ﮨﮯ۔" >>> normalized_text = preprocess(text) >>> # The text now contains proper spaces after digits and punctuations, >>> # normalized characters and no diacritics! >>> normalized_text اباوگل پاکستان ﻤﯿﮟ 20 سال ﺳﮯ ، وسائل کی کوئی کمی نہیں ﮨﮯ ۔
-
urduhack.preprocessing.
normalize_whitespace
(text: str)[source]¶ Given
text
str, replace one or more spacings with a single space, and one or more linebreaks with a single newline. Also strip leading/trailing whitespace.Parameters: text (str) – Urdu
textReturns: Returns a str
object containing normalized text.Return type: str Examples
>>> from urduhack.preprocessing import normalize_whitespace >>> text = "عراق اور شام اعلان کیا ہے دونوں جلد اپنے گے؟" >>> normalized_text = normalize_whitespace(text) >>> normalized_text عراق اور شام اعلان کیا ہے دونوں جلد اپنے گے؟
-
urduhack.preprocessing.
remove_punctuation
(text: str, marks=None) → str[source]¶ Remove punctuation from
text
by removing all instances ofmarks
.Parameters: Returns: returns a
str
object containing normalized text.Return type: Note
When
marks=None
, Python’s built-instr.translate()
is used to remove punctuation; otherwise, a regular expression is used instead. The former’s performance is about 5-10x faster.Examples
>>> from urduhack.preprocessing import remove_punctuation >>> output = remove_punctuation("کر ؟ سکتی ہے۔") کر سکتی ہے
-
urduhack.preprocessing.
remove_accents
(text: str) → str[source]¶ Remove accents from any accented unicode characters in
text
str, either by transforming them into ascii equivalents or removing them entirely.Parameters: text (str) – Urdu text Returns: str Examples
>>> from urduhack.preprocessing import remove_accents >>>text = "دالتِ عظمیٰ درخواست" >>> remove_accents(text)
‘دالت عظمی درخواست’
-
urduhack.preprocessing.
replace_urls
(text: str, replace_with='')[source]¶ Replace all URLs in
text
str withreplace_with
str.Parameters: Returns: Returns a
str
object replace url withreplace_with
text.Return type: Examples
>>> from urduhack.preprocessing import replace_urls >>> text = "20 www.gmail.com فیصد" >>> replace_urls(text) '20 فیصد'
-
urduhack.preprocessing.
replace_emails
(text: str, replace_with='')[source]¶ Replace all emails in
text
str withreplace_with
str.Parameters: Returns: Returns a
str
object replace emails withreplace_with
text.Return type: Examples
>>> text = "20 gunner@gmail.com فیصد" >>> from urduhack.preprocessing import replace_emails >>> replace_emails(text)
-
urduhack.preprocessing.
replace_numbers
(text: str, replace_with='')[source]¶ Replace all numbers in
text
str withreplace_with
str.Parameters: Returns: Returns a
str
object replace number withreplace_with
text.Return type: Examples
>>> from urduhack.preprocessing import replace_phone_numbers >>> text = "یعنی لائن آف کنٹرول پر فائربندی کا معاہدہ 555-123-4567 میں ہوا تھا" >>> replace_phone_numbers(text) 'یعنی لائن آف کنٹرول پر فائربندی کا معاہدہ میں ہوا تھا'
-
urduhack.preprocessing.
replace_phone_numbers
(text: str, replace_with='')[source]¶ Replace all phone numbers in
text
str withreplace_with
str.Parameters: Returns: Returns a
str
object replace number_no withreplace_with
text.Return type: Examples
>>> from urduhack.preprocessing import replace_numbers >>> text = "20 فیصد" >>> replace_numbers(text) ' فیصد'
-
urduhack.preprocessing.
replace_currency_symbols
(text: str, replace_with=None)[source]¶ Replace all currency symbols in
text
str with string specified byreplace_with
str.Parameters: Returns: Returns a
str
object containing normalized text.Return type: Examples
>>> from urduhack.preprocessing import replace_currency_symbols >>> text = "یعنی لائن آف کنٹرول پر فائربندی کا معاہدہ 2003 میں ہوا 33$ تھا۔" >>> replace_currency_symbols(text)
‘یعنی لائن آف کنٹرول پر فائربندی کا معاہدہ 2003 میں ہوا 33USD تھا۔’