Name Entity Recognition (NER)

on Apple Inc.' s Wikipedia Article

Wikipedia stands as a collaborative online encyclopedia, offering an extensive repository of knowledge across a myriad of topics. It encompasses a wealth of information, making it an ideal resource for exploration and analysis in Natural Language Processing (NLP).

Named Entity Recognition (NER) is a pivotal task in NLP that involves identifying and categorizing entities like names of people, organizations, locations, etc., within text data. Analyzing a Wikipedia article through NER not only showcases how NLP techniques can unveil valuable insights from extensive text data but also provides a practical understanding of how NER can be applied to real-world textual content, aiding in information extraction and analysis.

In this tutorial, we'll leverage Python's NLTK and Wikipedia API to explore NER in action for any article in Wikipedia. I will also provide exercises for you to practice.

If you haven't already, use the following code snippet to install Wikipedia API and NTLK.

pip install wikipedia-api, nltk

Methodology and Workflow

Figure below displays the design of a straightforward information extraction system. First, the raw text undergoes sentence segmentation via a sentence tokenizer, followed by word segmentation using a word tokenizer for each sentence. Subsequently, part-of-speech tagging is applied to every sentence, which plays a crucial role in the subsequent step—named entity detection. This stage involves identifying possible mentions of significant entities within each sentence. Finally, the system employs relation detection to uncover probable relationships between various entities in the text.

Step 1: Retrieving and Preprocessing (Tokenizing) Data

First, we utilize the Wikipedia API to retrieve an article about a specific topic (with any language supported by Wikipedia), in this case, "Apple Inc.". We tokenize the text into sentences and then further tokenize each sentence into words.

# Import Libraries
from wikipediaapi import Wikipedia
from nltk.tokenize import word_tokenize, sent_tokenize

# Wikipedia API to retrieve an article about "Apple Inc."
# wikipedia api requires user_agent info, you can use a dummy user agent:
user_agent = 'MyProjectName (merlin@example.com)'
wiki_wiki = Wikipedia(user_agent=user_agent, language='en')) 
article = wiki_wiki.page('Apple Inc.')

# Tokenization
# Tokenize the text into sentences
sentences = sent_tokenize(article.text) # use 'text' attribute to access content
# display first 3 sentences
print(sentences[:3]) 
# Further tokenize each sentence into words
token_sentences = [word_tokenize(sent) for sent in sentences]
# display tokens of the 1st sentence
print(token_sentences[0]) 

Exercise 1:

  • Choose a different topic of interest and retrieve its Wikipedia article.

  • Modify the code to print the first five sentences instead of the initial three.

Solution 1
user_agent = 'MyProjectName (merlin@example.com)'
wiki_wiki = Wikipedia(user_agent=user_agent, language='en')) 
article = wiki_wiki.page('Tesla Inc.')

# Tokenization
# Tokenize the text into sentences
sentences = sent_tokenize(article.text) # use 'text' attribute to access content
# display first 5 sentences
print(sentences[:5]) 

Step 2: Part-of-Speech Tagging

Next, we perform Part-of-Speech (POS) tagging on the tokenized sentences to determine the grammatical structure and identify parts of speech for each word.

# Perform Part-of-Speech (POS) tagging
from nltk.tag import pos_tag

# Tag each tokenized sentence into parts of speech
pos_sentences = [pos_tag(sent) for sent in token_sentences]

# Print the POS tags for the first sentence
pos_sentences[0]

Exercise 2: Print the POS tags for the first ten sentences

Solution 2
# Print the POS tags for the first ten sentences
for sent in pos_sentences[:10]:
    print(sent)

Step 3: Named Entity Recognition

The primary method we will employ for name entity recognition is chunking. This process segments and labels multi-token sequences, as demonstrated figure below. The smaller boxes represent word-level tokenization and part-of-speech tagging, while the larger boxes depict higher-level chunking. Each of these larger boxes is called a chunk. Similar to tokenization, the segments generated by a chunker do not overlap within the source text.

We apply NLTK's ne_chunk_sents function to extract named entities. These will then be categorized and counted using a defaultdict. Be aware that you may also need to download NLTK's maxent_ne_chunker and words packages to be able to run ne_chunk_sents function. You can download them as:

import nltk
nltk.download('maxent_ne_chunker')
nltk.download('words')

We can now extract named entities:

# Named Entity Recognition (NER)
from nltk.chunk import ne_chunk_sents
from collections import defaultdict

# Create the named entity chunks
chunked_sentences = ne_chunk_sents(pos_sentences)

# Create a defaultdict for NER categories
ner_categories = defaultdict(int)

# Count and categorize named entities
for sent in chunked_sentences:
    for chunk in sent:
        if hasattr(chunk, 'label'):
            ner_categories[chunk.label()] += 1

ner_categories

When calling ne_chunk_sents(pos_sentences, binary=False), a few things are happening:

  1. pos_sentences: This is the list of tokenized sentences, [('Apple', 'NNP'), ('Inc.', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('company', 'NN')], with each word already tagged for its part of speech. For instance, words might be labeled as nouns (NN), verbs (VB), adjectives (JJ), etc.

  2. binary(optional parameter): When binary is set to True, the function performs binary named entity chunking, where it labels named entities simply as "NE" without specifying the entity type (e.g., 'PERSON', 'ORGANIZATION', etc.). If binary is set to False (default), it includes the entity type information (e.g., 'GPE' for geopolitical entity, 'PERSON', 'ORGANIZATION', etc.).

In essence, the ne_chunk_sents() function applies a parsing algorithm to identify sequences of words that represent named entities, classifies these sequences into predefined categories, and marks them with specific labels according to their entity types (if not using binary chunking).

For instance, given a sentence, "Apple Inc. is a technology company," the ne_chunk_sents() function might identify "Apple Inc." as a single named entity and label it as NE. If binary were set to False, it could further specify that "Apple Inc." is an ORGANIZATION entity.

The hasattr() function checks whether an object has a particular attribute or not. It takes two arguments: the object and the attribute name as a string. Therefore, hasattr(chunk, 'label') is used to check if the current element being processed within the NER results has a 'label' attribute. If it does, this indicates that the element represents a named entity, thereby enabling further processing or counting of the named entities.

The defaultdict feature enables us to set up a dictionary that automatically assigns a default value to keys that don't exist. By utilizing the 'int' argument, we ensure that any absent keys receive a default value of 0. This functionality proves especially useful for storing word counts in this particular exercise.

Exercise 3:

  • Identify and print the named entities (NEs) detected in the article.

  • Count and display the number of named entities for different categories like persons, organizations, and locations.

Solution 3
# Identify and print the named entities detected
for sent in chunked_sentences:
    for chunk in sent:
        if hasattr(chunk, 'label'):
            print(chunk)

# Count and display the number of named entities for different categories
print(ner_categories)

Step 4: Visualizing NER Categories

To visualize the distribution of detected NER categories, we create a pie chart using Matplotlib.

# Create labels and values for the pie chart
labels = list(ner_categories.keys())
values = list(ner_categories.values())

# Create the pie chart
plt.pie(values, labels=labels, autopct='%1.1f%%', startangle=140)

# Display the chart
plt.show()

Exercise 4:

  • Customize the pie chart to display percentages with no decimal points.

  • Enhance the chart by providing different colors for each NER category.

Solution 4
# Customize the pie chart
# Display percentages with no decimal points
plt.pie(values, labels=labels, autopct='%1.0f%%', startangle=140)
plt.show()

Conclusion

Understanding NER is pivotal in extracting meaningful insights from text. This tutorial showcased how to retrieve text data, tokenize, perform POS tagging, extract named entities, and visualize their distribution. These steps lay a strong foundation for diving deeper into NLP tasks.

By following these steps and exercises, you can gain a better understanding of NER and its application in NLP.

Feel free to modify and expand on these exercises to deepen your understanding and explore more features of the NLTK library.

Code Repository: https://github.com/sedarsahin/NLP/tree/master/NameEntityRecognition

Last updated