Urdu has witnessed the development of significant applications such as email spam detection, genre identification, product review analysis, news categorization, fake news detection, text classification and many more Urdu is considered a linguistically rich and morphologically complex language, thus, state of the art natural language processing APIs like Gensim, SPacy, NLTK, CoreNLP can not process Urdu text at all

SETMOKE API is a language processing toolkit based on machine and deep learning which provides multifarious modules to manage and process Urdu text. It may help numerous researchers and practitioners to develop smart applications. Contribution of Frame works

SETMOKE API Provides Following Modules

1- Pre-processing
2- Urdu Text Classification
3- Urdu Text Summarization
4- Urdu Stemmer
5- Urdu Question Classification

Preprocessing of text enables the better extraction of non-trivial knowledge from unstructured text Preprocessing pipeline comprises of tokenization, stemming, pos tagging, and named entity recognition enable the extraction of significant information from unstructured text. SETMOKE API preprocessing module provides robust algorithms for tokenization, stemming, pos tagging and named entity recognition which are considered indispensable for sentiment analysis, and recommendation systems

Pre-Processing Interface


Urdu Text Classification

Text classification plays an important role for the development of diverse applications such as email spam detection, gender identification, product review analysis, news categorization, and fake news detection. SETMOKE API text classification module employ ten filter based feature selection methods, two feature representation approaches, and two machine learning classifiers to effectively classify Urdu text in one of the predefined categories.

Urdu Text Classification Methodology

Urdu Text Classification Interface

Urdu Text Summarization

Automatic text summarization is being extensively used for various renowned languages (English, Chinese) in order generate precise and fluent summaries. SETMOKE APIs text summarization module exploits five state of the art extractive summarization methods in order to generate an effective summary of single document

Urdu Text Summarization Methodology

Urdu Text Summarization Interface

Urdu Steemer

Stemming plays a vital role to alleviate data sparsity problems by converting inflected forms of words to their base forms, thus, reducing dimensionality of data up to great extent SETMOKE Urdu stemmer works as follows:

Urdu Question Classification

Question classification refers to the process of classifying questions into predefined categories. It plays an effective role in the performance of information retrieval. With the help of question classification, the lookup span of the search engine can be reduced upto great extent through question classification as search engine has to search the answer of the provided query only in certain domain and context. SETMOKE question classification module classify questions into three, six, and seven classes based on difficulty, general properties, and subjectivity

Methodology of Urdu Question Classification

Urdu Question Classification Interface

Named Entity Recognition

Named entity recognition plays a vital role in the development of numerous applications based on speech recognition, information retrieval, and machine translation. SETMOKE APIs named entity recognition module is capable to detect person name, organization, location, date, time, number,
and designation from Urdu text.

Methodology of NER

NER Interface

Information Retrieval

Information retrieval (IR) refers to the process of finding and acquiring certain data or documents from large collections against particular user query IR has revolutionized search engines by providing robust methodologies to extract most relevant documen ts or information from unstructured texts SETMOKE API IR module is capable to index, store and query a document based on relevancy. It implements several similarity measures such as BM25,TFIDF, Frequency, Dfree and PL2.

Methodology of IR

Information Retrieval Interface

Urdu Sentiment Analysis

Sentiment analysis or opinion mining is all about identifying people perceptions regarding the certain organization, person, place, product or service. Customers perceptions and feedback are usually acquired through focus groups, surveys, observation, and some other pretty labor intensive methods SETMOKE sentiment analysis employ deep learning models and an existing Urdu sentiment dictionary comprising of 4000 positive and 2000 negative expressions to correctly classify sentiments. We also extend the Urdu sentiment dictionary with 4000 neutral expressions in order to better classify sentiments expressed in Nastaleeq Urdu.

Methodology of Urdu Sentiment Analysis

Urdu Sentiment Analysis Interface

Datasets and Results (As per Need)

Contribution of Datasets

Urdu Stemmer This dataset has 4162 base words and 9743 words with possible variations of the base words.Accuracy is 97 % Urdu Text Classification Statistics of DSL and CLE dataset

Results of Urdu Text Classification

1- DSL Dataset Results Results of Text Classification CLE dataset Results

2- Urdu Text Summarization Dataset and Statistics

3- Results of Urdu Text Summarization

4- Urdu Question Classification

Subjectivity Based E-Learning Dataset Classes and Number of Questions

Biology : 274
Chemistry: 52
Computer: 166
Education: 124
Environment: 13
Pakistan Studies: 39
Physics: 137
Algorithm : CNN-based Model
Accuracy : N/A
Difficulty based E-Learning Dataset
Classes and Number of Questions

Hard : 274
Medium: 52
Easy: 166
General Urdu Question Classification dataset
Description : 801
Entity : 1004
Abbreviation : 35
Number : 408
Location : 193
Other : 76

Urdu Named Entity (Recognition Available Tags)


Algorithm : Bidirectional-LSTM

Accuracy : 93.35%

Dataset Statistics

Total Documents: 633
Total Sentences: 3232
Total Words: 109816
Total unique words : 12527
Words with no Tags: 88.75%
Location tags: 1.92%
Person tags: 3.44%
Time tags: 0.36%
Organization tags: 1.48%
Number tags: 2.09%
Designation tags: 0.66%
DATE tags: 1.3%

Nastaliq Urdu Sentiment Classes and Training Words

Positive: 2633 Words

Negative: 4754 Words
Neutral: 2000 Words
Total 109 sentences for testing
Positive: 51 Sentences
Negative: 48 Sentences
Neutral: 10 Sentences
Accuracy : 97%

Roman Urdu Sentiment Analysis

Total Documents: 12099
Training: 8711
Validation : 2178
Testing : 1210
Accuracy : 71.4%

Urdu Information Retrieval

We used 500 documents from different classes such as agriculture,business, entertainment, news and sports etc. We used 30 Queries relevant to
document classes to make Gold Standard dataset.
Applications of:

1- SETMOKE Spam Complaints Filtering System

2- Criminal Case Log Aggregation & Summarization

3- Ontology based Urdu Car Advertisement Search Engine

Posted on: May 12, 2020