Data Science Hub
  • Data Science Hub
  • STATISTICS
    • Introduction
    • Fundamentals
      • Data Types
      • Central Tendency, Asymmetry, and Variability
      • Sampling
      • Confidence Interval
      • Hypothesis Testing
    • Distributions
      • Exponential Distribution
    • A/B Testing
      • Sample Size Calculation
      • Multiple Testing
  • Database
    • Database Fundamentals
    • Database Management Systems
    • Data Warehouse vs Data Lake
  • SQL
    • SQL Basics
      • Creating and Modifying Tables/Views
      • Data Types
      • Joins
    • SQL Rules
    • SQL Aggregate Functions
    • SQL Window Functions
    • SQL Data Manipulation
      • String Operations
      • Date/Time Operations
    • SQL Descriptive Stats
    • SQL Tips
    • SQL Performance Tuning
    • SQL Customization
    • SQL Practice
      • Designing Databases
        • Spotify Database Design
      • Most Commonly Asked
      • Mixed Queries
      • Popular Websites For SQL Practice
        • SQLZoo
          • World - BBC Tables
            • SUM and COUNT Tutorial
            • SELECT within SELECT Tutorial
            • SELECT from WORLD Tutorial
            • Select Quiz
            • BBC QUIZ
            • Nested SELECT Quiz
            • SUM and COUNT Quiz
          • Nobel Table
            • SELECT from Nobel Tutorial
            • Nobel Quiz
          • Soccer / Football Tables
            • JOIN Tutorial
            • JOIN Quiz
          • Movie / Actor / Casting Tables
            • More JOIN Operations Tutorial
            • JOIN Quiz 2
          • Teacher - Dept Tables
            • Using Null Quiz
          • Edinburgh Buses Table
            • Self join Quiz
        • HackerRank
          • SQL (Basic)
            • Select All
            • Select By ID
            • Japanese Cities' Attributes
            • Revising the Select Query I
            • Revising the Select Query II
            • Revising Aggregations - The Count Function
            • Revising Aggregations - The Sum Function
            • Revising Aggregations - Averages
            • Average Population
            • Japan Population
            • Population Density Difference
            • Population Census
            • African Cities
            • Average Population of Each Continent
            • Weather Observation Station 1
            • Weather Observation Station 2
            • Weather Observation Station 3
            • Weather Observation Station 4
            • Weather Observation Station 6
            • Weather Observation Station 7
            • Weather Observation Station 8
            • Weather Observation Station 9
            • Weather Observation Station 10
            • Weather Observation Station 11
            • Weather Observation Station 12
            • Weather Observation Station 13
            • Weather Observation Station 14
            • Weather Observation Station 15
            • Weather Observation Station 16
            • Weather Observation Station 17
            • Weather Observation Station 18
            • Weather Observation Station 19
            • Higher Than 75 Marks
            • Employee Names
            • Employee Salaries
            • The Blunder
            • Top Earners
            • Type of Triangle
            • The PADS
          • SQL (Intermediate)
            • Weather Observation Station 5
            • Weather Observation Station 20
            • New Companies
            • The Report
            • Top Competitors
            • Ollivander's Inventory
            • Challenges
            • Contest Leaderboard
            • SQL Project Planning
            • Placements
            • Symmetric Pairs
            • Binary Tree Nodes
            • Interviews
            • Occupations
          • SQL (Advanced)
            • Draw The Triangle 1
            • Draw The Triangle 2
            • Print Prime Numbers
            • 15 Days of Learning SQL
          • TABLES
            • City - Country
            • Station
            • Hackers - Submissions
            • Students
            • Employee - Employees
            • Occupations
            • Triangles
        • StrataScratch
          • Netflix
            • Oscar Nominees Table
            • Nominee Filmography Table
            • Nominee Information Table
          • Audible
            • Easy - Audible
          • Spotify
            • Worldwide Daily Song Ranking Table
            • Billboard Top 100 Year End Table
            • Daily Rankings 2017 US
          • Google
            • Easy - Google
            • Medium - Google
            • Hard - Google
        • LeetCode
          • Easy
  • Python
    • Basics
      • Variables and DataTypes
        • Lists
        • Dictionaries
      • Control Flow
      • Functions
    • Object Oriented Programming
      • Restaurant Modeler
    • Pythonic Resources
    • Projects
  • Machine Learning
    • Fundamentals
      • Supervised Learning
        • Classification Algorithms
          • k-Nearest Neighbors
            • kNN Parameters & Attributes
          • Logistic Regression
        • Classification Report
      • UnSupervised Learning
        • Clustering
          • Evaluation
      • Preprocessing
        • Scalers: Standard vs MinMax
        • Feature Selection vs Dimensionality Reduction
        • Encoding
    • Frameworks
    • Machine Learning in Advertising
    • Natural Language Processing
      • Stopwords
      • Name Entity Recognition (NER)
      • Sentiment Analysis
        • Agoda Reviews - Part I - Scraping Reviews, Detecting Languages, and Preprocessing
        • Agoda Reviews - Part II - Sentiment Analysis and WordClouds
    • Recommendation Systems
      • Spotify Recommender System - Artists
  • Geospatial Analysis
    • Geospatial Analysis Basics
    • GSA at Work
      • Web Scraping and Mapping
  • GIT
    • GIT Essentials
    • Connecting to GitHub
  • FAQ
    • Statistics
  • Cloud Computing
    • Introduction to Cloud Computing
    • Google Cloud Platform
  • Docker
    • What is Docker?
Powered by GitBook
On this page
  • Methodology and Workflow
  • Step 1: Retrieving and Preprocessing (Tokenizing) Data
  • Step 2: Part-of-Speech Tagging
  • Step 3: Named Entity Recognition
  • Step 4: Visualizing NER Categories
  • Conclusion

Was this helpful?

  1. Machine Learning
  2. Natural Language Processing

Name Entity Recognition (NER)

on Apple Inc.' s Wikipedia Article

Last updated 11 months ago

Was this helpful?

stands as a collaborative online encyclopedia, offering an extensive repository of knowledge across a myriad of topics. It encompasses a wealth of information, making it an ideal resource for exploration and analysis in .

Named Entity Recognition (NER) is a pivotal task in NLP that involves identifying and categorizing entities like names of people, organizations, locations, etc., within text data. Analyzing a Wikipedia article through NER not only showcases how NLP techniques can unveil valuable insights from extensive text data but also provides a practical understanding of how NER can be applied to real-world textual content, aiding in information extraction and analysis.

In this tutorial, we'll leverage Python's and to explore NER in action for any article in Wikipedia. I will also provide exercises for you to practice.

pip install wikipedia-api, nltk

Methodology and Workflow

Figure below displays the design of a straightforward information extraction system. First, the raw text undergoes sentence segmentation via a sentence tokenizer, followed by word segmentation using a word tokenizer for each sentence. Subsequently, part-of-speech tagging is applied to every sentence, which plays a crucial role in the subsequent step—named entity detection. This stage involves identifying possible mentions of significant entities within each sentence. Finally, the system employs relation detection to uncover probable relationships between various entities in the text.

Step 1: Retrieving and Preprocessing (Tokenizing) Data

First, we utilize the Wikipedia API to retrieve an article about a specific topic (with any language supported by Wikipedia), in this case, "Apple Inc.". We tokenize the text into sentences and then further tokenize each sentence into words.

# Import Libraries
from wikipediaapi import Wikipedia
from nltk.tokenize import word_tokenize, sent_tokenize

# Wikipedia API to retrieve an article about "Apple Inc."
# wikipedia api requires user_agent info, you can use a dummy user agent:
user_agent = 'MyProjectName (merlin@example.com)'
wiki_wiki = Wikipedia(user_agent=user_agent, language='en')) 
article = wiki_wiki.page('Apple Inc.')

# Tokenization
# Tokenize the text into sentences
sentences = sent_tokenize(article.text) # use 'text' attribute to access content
# display first 3 sentences
print(sentences[:3]) 
# Further tokenize each sentence into words
token_sentences = [word_tokenize(sent) for sent in sentences]
# display tokens of the 1st sentence
print(token_sentences[0]) 

Exercise 1:

  • Choose a different topic of interest and retrieve its Wikipedia article.

  • Modify the code to print the first five sentences instead of the initial three.

Solution 1
user_agent = 'MyProjectName (merlin@example.com)'
wiki_wiki = Wikipedia(user_agent=user_agent, language='en')) 
article = wiki_wiki.page('Tesla Inc.')

# Tokenization
# Tokenize the text into sentences
sentences = sent_tokenize(article.text) # use 'text' attribute to access content
# display first 5 sentences
print(sentences[:5]) 

Step 2: Part-of-Speech Tagging

Next, we perform Part-of-Speech (POS) tagging on the tokenized sentences to determine the grammatical structure and identify parts of speech for each word.

# Perform Part-of-Speech (POS) tagging
from nltk.tag import pos_tag

# Tag each tokenized sentence into parts of speech
pos_sentences = [pos_tag(sent) for sent in token_sentences]

# Print the POS tags for the first sentence
pos_sentences[0]

Exercise 2: Print the POS tags for the first ten sentences

Solution 2
# Print the POS tags for the first ten sentences
for sent in pos_sentences[:10]:
    print(sent)

Step 3: Named Entity Recognition

We apply NLTK's ne_chunk_sents function to extract named entities. These will then be categorized and counted using a defaultdict. Be aware that you may also need to download NLTK's maxent_ne_chunker and words packages to be able to run ne_chunk_sents function. You can download them as:

import nltk
nltk.download('maxent_ne_chunker')
nltk.download('words')

We can now extract named entities:

# Named Entity Recognition (NER)
from nltk.chunk import ne_chunk_sents
from collections import defaultdict

# Create the named entity chunks
chunked_sentences = ne_chunk_sents(pos_sentences)

# Create a defaultdict for NER categories
ner_categories = defaultdict(int)

# Count and categorize named entities
for sent in chunked_sentences:
    for chunk in sent:
        if hasattr(chunk, 'label'):
            ner_categories[chunk.label()] += 1

ner_categories

When calling ne_chunk_sents(pos_sentences, binary=False), a few things are happening:

  1. pos_sentences: This is the list of tokenized sentences, [('Apple', 'NNP'), ('Inc.', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('company', 'NN')], with each word already tagged for its part of speech. For instance, words might be labeled as nouns (NN), verbs (VB), adjectives (JJ), etc.

  2. binary(optional parameter): When binary is set to True, the function performs binary named entity chunking, where it labels named entities simply as "NE" without specifying the entity type (e.g., 'PERSON', 'ORGANIZATION', etc.). If binary is set to False (default), it includes the entity type information (e.g., 'GPE' for geopolitical entity, 'PERSON', 'ORGANIZATION', etc.).

In essence, the ne_chunk_sents() function applies a parsing algorithm to identify sequences of words that represent named entities, classifies these sequences into predefined categories, and marks them with specific labels according to their entity types (if not using binary chunking).

For instance, given a sentence, "Apple Inc. is a technology company," the ne_chunk_sents() function might identify "Apple Inc." as a single named entity and label it as NE. If binary were set to False, it could further specify that "Apple Inc." is an ORGANIZATION entity.

The hasattr() function checks whether an object has a particular attribute or not. It takes two arguments: the object and the attribute name as a string. Therefore, hasattr(chunk, 'label') is used to check if the current element being processed within the NER results has a 'label' attribute. If it does, this indicates that the element represents a named entity, thereby enabling further processing or counting of the named entities.

The defaultdict feature enables us to set up a dictionary that automatically assigns a default value to keys that don't exist. By utilizing the 'int' argument, we ensure that any absent keys receive a default value of 0. This functionality proves especially useful for storing word counts in this particular exercise.

Exercise 3:

  • Identify and print the named entities (NEs) detected in the article.

  • Count and display the number of named entities for different categories like persons, organizations, and locations.

Solution 3
# Identify and print the named entities detected
for sent in chunked_sentences:
    for chunk in sent:
        if hasattr(chunk, 'label'):
            print(chunk)

# Count and display the number of named entities for different categories
print(ner_categories)

Step 4: Visualizing NER Categories

To visualize the distribution of detected NER categories, we create a pie chart using Matplotlib.

# Create labels and values for the pie chart
labels = list(ner_categories.keys())
values = list(ner_categories.values())

# Create the pie chart
plt.pie(values, labels=labels, autopct='%1.1f%%', startangle=140)

# Display the chart
plt.show()

Exercise 4:

  • Customize the pie chart to display percentages with no decimal points.

  • Enhance the chart by providing different colors for each NER category.

Solution 4
# Customize the pie chart
# Display percentages with no decimal points
plt.pie(values, labels=labels, autopct='%1.0f%%', startangle=140)
plt.show()

Conclusion

Understanding NER is pivotal in extracting meaningful insights from text. This tutorial showcased how to retrieve text data, tokenize, perform POS tagging, extract named entities, and visualize their distribution. These steps lay a strong foundation for diving deeper into NLP tasks.

By following these steps and exercises, you can gain a better understanding of NER and its application in NLP.

Feel free to modify and expand on these exercises to deepen your understanding and explore more features of the NLTK library.

If you haven't already, use the following code snippet to install and .

The primary method we will employ for name entity recognition is . This process segments and labels multi-token sequences, as demonstrated figure below. The smaller boxes represent word-level tokenization and part-of-speech tagging, while the larger boxes depict higher-level chunking. Each of these larger boxes is called a chunk. Similar to tokenization, the segments generated by a chunker do not overlap within the source text.

Code Repository:

Wikipedia API
NTLK
chunking
https://github.com/sedarsahin/NLP/tree/master/NameEntityRecognition
Wikipedia
Natural Language Processing (NLP)
NLTK
Wikipedia API
Credit
Credit
NLTK Book
NLTK Book