Text PreProcessing¶
The pre-processing of Urdu text is necessary to make it useful for the machine learning tasks. This module provides the following functionality:
- Normalize whitespace
- Replace urls
- Replace emails
- Replace number
- Replace phone_number
- Replace currency_symbols
You can look for all the different functions that come with pre-process
module in the reference here preprocess
.
-
urduhack.preprocessing.
normalize_whitespace
(text: str)[source]¶ Given
text
str, replace one or more spacings with a single space, and one or more linebreaks with a single newline. Also strip leading/trailing whitespace.Parameters: text (str) – Urdu
textReturns: Returns a str
object containing normalized text.Return type: str Examples
>>> from urduhack.preprocessing import normalize_whitespace >>> text = "عراق اور شام اعلان کیا ہے دونوں جلد اپنے گے؟" >>> normalized_text = normalize_whitespace(text) >>> normalized_text عراق اور شام اعلان کیا ہے دونوں جلد اپنے گے؟
-
urduhack.preprocessing.
remove_punctuation
(text: str, marks=None) → str[source]¶ Remove punctuation from
text
by removing all instances ofmarks
.Parameters: Returns: returns a
str
object containing normalized text.Return type: Note
When
marks=None
, Python’s built-instr.translate()
is used to remove punctuation; otherwise, a regular expression is used instead. The former’s performance is about 5-10x faster.Examples
>>> from urduhack.preprocessing import remove_punctuation >>> output = remove_punctuation("کر ؟ سکتی ہے۔") کر سکتی ہے
-
urduhack.preprocessing.
remove_accents
(text: str) → str[source]¶ Remove accents from any accented unicode characters in
text
str, either by transforming them into ascii equivalents or removing them entirely.Parameters: text (str) – Urdu text Returns: str Examples
>>> from urduhack.preprocessing import remove_accents >>>text = "دالتِ عظمیٰ درخواست" >>> remove_accents(text)
‘دالت عظمی درخواست’
-
urduhack.preprocessing.
replace_urls
(text: str, replace_with='')[source]¶ Replace all URLs in
text
str withreplace_with
str.Parameters: Returns: Returns a
str
object replace url withreplace_with
text.Return type: Examples
>>> from urduhack.preprocessing import replace_urls >>> text = "20 www.gmail.com فیصد" >>> replace_urls(text) '20 فیصد'
-
urduhack.preprocessing.
replace_emails
(text: str, replace_with='')[source]¶ Replace all emails in
text
str withreplace_with
str.Parameters: Returns: Returns a
str
object replace emails withreplace_with
text.Return type: Examples
>>> text = "20 gunner@gmail.com فیصد" >>> from urduhack.preprocessing import replace_emails >>> replace_emails(text)
-
urduhack.preprocessing.
replace_numbers
(text: str, replace_with='')[source]¶ Replace all numbers in
text
str withreplace_with
str.Parameters: Returns: Returns a
str
object replace number withreplace_with
text.Return type: Examples
>>> from urduhack.preprocessing import replace_phone_numbers >>> text = "یعنی لائن آف کنٹرول پر فائربندی کا معاہدہ 555-123-4567 میں ہوا تھا" >>> replace_phone_numbers(text) 'یعنی لائن آف کنٹرول پر فائربندی کا معاہدہ میں ہوا تھا'
-
urduhack.preprocessing.
replace_phone_numbers
(text: str, replace_with='')[source]¶ Replace all phone numbers in
text
str withreplace_with
str.Parameters: Returns: Returns a
str
object replace number_no withreplace_with
text.Return type: Examples
>>> from urduhack.preprocessing import replace_numbers >>> text = "20 فیصد" >>> replace_numbers(text) ' فیصد'
-
urduhack.preprocessing.
replace_currency_symbols
(text: str, replace_with=None)[source]¶ Replace all currency symbols in
text
str with string specified byreplace_with
str.Parameters: Returns: Returns a
str
object containing normalized text.Return type: Examples
>>> from urduhack.preprocessing import replace_currency_symbols >>> text = "یعنی لائن آف کنٹرول پر فائربندی کا معاہدہ 2003 میں ہوا 33$ تھا۔" >>> replace_currency_symbols(text)
‘یعنی لائن آف کنٹرول پر فائربندی کا معاہدہ 2003 میں ہوا 33USD تھا۔’