In this blog post, we will talk about Natural Language Processing, Natural Language Processing Applications, and building block concepts.
What is Natural Language Processing (NLP)?
Natural Language Processing is an artificial intelligence system in which human language is processed and made understandable by computers. For computers to understand human language, it must be expressed to computers numerically. In fact, we frequently use many applications made with Natural Language Processing nowadays.
Examples of NLP Applications
- Text Classification and Categorization
- Summarization
- Named Entity Recognition (NER)
- Part-of-Speech Tagging
- Semantic Parsing and Question Answering
- Paraphrase Detection
- Language Generation
- Machine Translation
- Speech Recognition
- Character Recognition
- Spell Checking
Objects in NLP
While building a generalized library for NLP applications one should think about which concepts are there to encapsulate as objects or building blocks of this domain. These objects are simplified building blocks that make up other blocks forming a hierarchy to work with. This makes things easier to work with. For our application domain, we can come up with 4 actors formed in a hierarchy. These are;
- Corpus
- Document
- Sentence
- Token
Let’s examine these objects through an example.
- Corpus: A website with travel guide blog posts.
- Document: A blog post on the travel guide website.
- Sentence: Sentences in a blog post on the travel guide website.
- Token: All words, punctuation, and emojis in sentences.
Objects in SadedeGel
SadedeGel uses this structure to represent a corpus and its elements. A corpus is provided as a built-in dataset of the library. Some of these are provided publicly including TS Corpus containing more than 300K news documents in raw and tokenized format with their category classes.
Doc
To trigger the SadedeGel NLP pipeline, initialize the Doc instance with a document string. In the example below, we have defined one of the data sets in the SadedeGel library into a Doc object with load_raw_corpus
. Then, we viewed the text.
Example:
Output:
Sentence
A Sentence
object represents a sentence and holds a list of the Token
s in the sentence. In the example below, we have defined one of the data sets in the SadedeGel library into a Doc object with load_raw_corpus
. Then we accessed the Sentence object with a built-in list function.
Example:
Output:
Token
It is expressed as a word, punctuation mark, emoji, or space Token
. In the SadedeGel library, the ‘Tokens’ of a sentence can be obtained as follows.
Example:
Output:
Thanks for reading this blog post. If you want to learn more about SadedeGel, you can follow our other blog posts and visit SadedeGel Github Page.