Name Entity Recognition (NER)
on Apple Inc.' s Wikipedia Article
Last updated
on Apple Inc.' s Wikipedia Article
Last updated
Wikipedia stands as a collaborative online encyclopedia, offering an extensive repository of knowledge across a myriad of topics. It encompasses a wealth of information, making it an ideal resource for exploration and analysis in Natural Language Processing (NLP).
Named Entity Recognition (NER) is a pivotal task in NLP that involves identifying and categorizing entities like names of people, organizations, locations, etc., within text data. Analyzing a Wikipedia article through NER not only showcases how NLP techniques can unveil valuable insights from extensive text data but also provides a practical understanding of how NER can be applied to real-world textual content, aiding in information extraction and analysis.
In this tutorial, we'll leverage Python's NLTK and Wikipedia API to explore NER in action for any article in Wikipedia. I will also provide exercises for you to practice.
If you haven't already, use the following code snippet to install Wikipedia API and NTLK.
Figure below displays the design of a straightforward information extraction system. First, the raw text undergoes sentence segmentation via a sentence tokenizer, followed by word segmentation using a word tokenizer for each sentence. Subsequently, part-of-speech tagging is applied to every sentence, which plays a crucial role in the subsequent step—named entity detection. This stage involves identifying possible mentions of significant entities within each sentence. Finally, the system employs relation detection to uncover probable relationships between various entities in the text.
First, we utilize the Wikipedia API to retrieve an article about a specific topic (with any language supported by Wikipedia), in this case, "Apple Inc.". We tokenize the text into sentences and then further tokenize each sentence into words.
Exercise 1:
Choose a different topic of interest and retrieve its Wikipedia article.
Modify the code to print the first five sentences instead of the initial three.
Next, we perform Part-of-Speech (POS) tagging on the tokenized sentences to determine the grammatical structure and identify parts of speech for each word.
Exercise 2: Print the POS tags for the first ten sentences
The primary method we will employ for name entity recognition is chunking. This process segments and labels multi-token sequences, as demonstrated figure below. The smaller boxes represent word-level tokenization and part-of-speech tagging, while the larger boxes depict higher-level chunking. Each of these larger boxes is called a chunk. Similar to tokenization, the segments generated by a chunker do not overlap within the source text.
We apply NLTK's ne_chunk_sents
function to extract named entities. These will then be categorized and counted using a defaultdict
. Be aware that you may also need to download NLTK's maxent_ne_chunker
and words
packages to be able to run ne_chunk_sents
function. You can download them as:
We can now extract named entities:
When calling ne_chunk_sents(pos_sentences, binary=False)
, a few things are happening:
pos_sentences
: This is the list of tokenized sentences, [('Apple', 'NNP'), ('Inc.', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('company', 'NN')]
, with each word already tagged for its part of speech. For instance, words might be labeled as nouns (NN), verbs (VB), adjectives (JJ), etc.
binary
(optional parameter): When binary
is set to True
, the function performs binary named entity chunking, where it labels named entities simply as "NE" without specifying the entity type (e.g., 'PERSON', 'ORGANIZATION', etc.). If binary
is set to False
(default), it includes the entity type information (e.g., 'GPE' for geopolitical entity, 'PERSON', 'ORGANIZATION', etc.).
In essence, the ne_chunk_sents()
function applies a parsing algorithm to identify sequences of words that represent named entities, classifies these sequences into predefined categories, and marks them with specific labels according to their entity types (if not using binary chunking).
For instance, given a sentence, "Apple Inc. is a technology company," the ne_chunk_sents()
function might identify "Apple Inc." as a single named entity and label it as NE
. If binary
were set to False
, it could further specify that "Apple Inc." is an ORGANIZATION
entity.
The hasattr()
function checks whether an object has a particular attribute or not. It takes two arguments: the object and the attribute name as a string. Therefore, hasattr(chunk, 'label')
is used to check if the current element being processed within the NER results has a 'label'
attribute. If it does, this indicates that the element represents a named entity, thereby enabling further processing or counting of the named entities.
The defaultdict
feature enables us to set up a dictionary that automatically assigns a default value to keys that don't exist. By utilizing the 'int' argument, we ensure that any absent keys receive a default value of 0. This functionality proves especially useful for storing word counts in this particular exercise.
Exercise 3:
Identify and print the named entities (NEs) detected in the article.
Count and display the number of named entities for different categories like persons, organizations, and locations.
To visualize the distribution of detected NER categories, we create a pie chart using Matplotlib.
Exercise 4:
Customize the pie chart to display percentages with no decimal points.
Enhance the chart by providing different colors for each NER category.
Understanding NER is pivotal in extracting meaningful insights from text. This tutorial showcased how to retrieve text data, tokenize, perform POS tagging, extract named entities, and visualize their distribution. These steps lay a strong foundation for diving deeper into NLP tasks.
By following these steps and exercises, you can gain a better understanding of NER and its application in NLP.
Feel free to modify and expand on these exercises to deepen your understanding and explore more features of the NLTK library.
Code Repository: https://github.com/sedarsahin/NLP/tree/master/NameEntityRecognition