Natural Language Toolkit (NLTK) originally created in 2001 as a part of a computational linguistics course in the Department of Computer and Information Science at the University of Pennsylvania.

Image result for language processing task in nltk
Language processing tasks and corresponding NLTK modules with examples of functionality

Features of NLTK
-Sentence and word tokenization
-part-of-speech tagging
-chunking and named entity recognition
-Text classification

NLTK was designed with four primary goals
Simplicity- To provide an intuitive framework along with substantial building blocks.

Consistency-Uniform framework with consistent interfaces and data structures and easily guessable method name.

Extensibility-New software can be easily accommodated, including alternative implementation and competing approaches to the same task.

Modularity-We doen’t have to understand the toolkit to use it.

-Jupyter Notebook via Anaconda Navigator
-Installing the NLTK module using pip command.
You can watch a tutorial for installation purposes.

Let’s get started
After installing NLTK, start up the Python interpreter.
import nltk“all”) // To be sure that you don’t face

A tab will appear on your screen.
Select the book from the column name identifier and click on the download button.
As soon as you type this code.
It loads the text of several books.

from import *

To find about these texts, we just have to enter their names
Eg Try entering “text1”

Searching Text
Concordance basically means the alphabetical list of the word which is present in the text.

text1.concordance("any_word_which_comes_to your_mind")

Similar Text
We can find out by appending the term similar to the name of the text in question, then inserting the relevant word in parentheses.

common_texts allows us to examine just the context that is shared by two or more words.
We have to enclose these words by square brackets as well as parentheses and separate them with a comma.


Concordance(token) provides you with the context in which a token is used. Similar(token) provides you with other words that appear in similar contexts.

To illustrate, here a more general description to approximate their functionality.

To generate some random text.
We can use generate() function.
Although the text is random, it reuses common words and phrases from the source text and gives us a sense of its style and content.

Counting Vocabulary-
Find the length of a text from start to finish, in terms of thw words and punctuation symbols that appear.

A token is defined as the technical name for a sequence of characters that we want to treat as a group.

We obtain a sorted list of vocabulary items, beginning with various punctuation symbols and continuing with words starting with A.

from __future__ import division
len(text3)/len(set(text3)) //Measure of the lexical richness of the text.
A word type is the form or spelling of the word independently of its specific occurrence in a text– that is, the word considered as a unique item of vocabulary.
But statistics state that it contains 2,789 distinct words.

We can also get how many times a particular word occurs in a text.
Using count(“type_the_name_of_you_word”)

Let’s define a function for lexical diversity
def lexical_diversity(text):
return len(text)/len(set(text))

def percentage (count,total):
return 100* count/total

Now to use the above function.
It basically works the same as in any C or any other programming language.

percentage(text4.count(“hello”), len(text))

Text Slicing and Indexing.
text4[any_number] //returns the word present at that index.
text4.index[“any_word”] //returns the index of the occurrence of the word.
text5[124:1244] –Slicing works the same way here too.

Understanding the set function in Python
A set is created by placing all the elements inside curly braces {}.
The elements can be of different data types.
Example: Integer, Float, Tuple, String.


//creating an empty set
empty_set={} //this creates an empty dictionary.
print(type(empty_set)) // You will get output as dictionary.
empty_set=set() //this creates an empty set.
print(type(empty_set)) //Output will be class set.
A set cannot contain lists.
set_with_list={[2,3,4]} //TypeError

Duplicate are avoided in all cases.
The update method can take tuples, list, string or any other set as its argument.
We can add single elements using the add() method and multiple elements using the update() method.

Discard function
A particular item can be removed from the set using discard() or remove() method.

But its better use the discard() because it throws the error when you try to remove an element that is not present in the list.

But let us say if you still insist upon using the remove command.


We can use the following operation to create mutations to a set.

update!=Update the set by adding elements from an iterable/another set.
intersection_update&=Update the set by keeping only the elements found in it and an iterable/another set.
Difference_update-=Update the set by removing elements found in an iterable/another set.
symmetric_difference_update^=Update the set by only keeping the elements found in either set, but not in both.

So how to tokenize a sentence?

from nltk.tokenize import sent_tokenize
from nltk.tokenize import sent_tokenize'punkt')
sent_tokenize("Hello world. This is my first program. We will learn how to use nltk.")

//Make sure to use add space

Output: [‘Hello world.’, ‘This is my first program.’, ‘We will learn how to use nltk.’]

Word Tokenization

from nltk.tokenize import word_tokenize
word_tokenize("this world is beautiful .")

Output:[‘this’, ‘world’, ‘is’, ‘beautiful’, ‘.’]
Note: It treats full stop as a word. Same follows for any symbol

from nltk.tokenize import word_tokenize
word_tokenize("What's up?")

Output: [‘What’, “‘s”, ‘up’, ‘?’]
Note it is ” ‘s ” but we want the punctuation to be separate and be treated as word
So we will use the word

from nltk.tokenize import wordpunct_tokenize
word_tokenize("What's up?")

Output: [‘What’, “‘”, ‘s’, ‘up’, ‘?’]
In this program, it is corrected