Search Engines in Traditional and Semantic Web

Chapter 2: Understanding How Search Works

Eng. Dr. Tiroshan Madushanka

Chapter 2 Overview 📚

Comparing traditional and semantic search

What We'll Learn

Before diving into the technical nuts and bolts of the Semantic Web, we'll understand how search engines work in both environments.

Goal: By comparing traditional and semantic search engines, we'll see the clear benefits and identify vital components of the Semantic Web.

Today's Journey

  • How traditional search engines build index tables
  • How crawlers work (depth-first vs breadth-first)
  • The role of metadata in search
  • Building a hypothetical Semantic Web search engine
  • Real-world example: Searching for digital SLR cameras
  • Identifying key components needed for the Semantic Web

🤔 Opening Question (5 min)

Think about your last frustrating search experience:

What made it frustrating? Too many results? Wrong results? Having to try multiple search terms?

The Central Question

Remember searching for "SOAP" in Chapter 1? We got dish detergent, soap operas, and everything except the W3C protocol we wanted.

Today we'll understand WHY this happens and HOW the Semantic Web can fix it.

2.1 Traditional Search Engines 🔍

Understanding the basics

The Current Reality

The Web has hundreds of millions of pages, and search engines (Google, Yahoo, etc.) are how we find information.

Key Fact: All search engines are essentially keyword matchers.

How It Works: The Indexation Process

1
Before the search engine launches: It prepares a huge index table

This is done by a special software called a crawler or spider

2
The crawler starts with seed URLs: Given by the main control component

Example: Start at www.example.com

3
Build index for every word: Create entries mapping words to URLs

Also track how many times each word appears (for ranking)

4
Follow links and repeat: Move to the next page and continue

This continues until resources are exhausted

Staggering Scale

Google can index about 4-5 billion web pages.

That's only 1% of the entire World Wide Web!

How a Crawler Works: Interactive Demo 🕷️

Visualizing the crawling process

Status: Ready
Seed
Page
Page 1
Page 2
Page 3
Page 4
Page 5
Page 6
🕷️

What You're Seeing: Depth-First Search

The crawler starts at the seed, processes the entire first page, then follows the FIRST link completely before coming back to other links.

Alternative: Breadth-first search would visit all pages at one level before going deeper.

💭 Think About It (3 min)

Which search strategy do you think is better: depth-first or breadth-first? Why?

Consider: freshness of content, memory usage, likelihood of finding popular pages

The Index Table: The Heart of Search 📊

Understanding data storage

What Gets Stored

For each word found, the crawler stores:

  • Word/Term: The keyword itself
  • Document Location: URL where it appears
  • Number of Appearances: How many times (for ranking)

Evolution of the Index Table

Initial Table (First Page):

Word Document Location Appearances
camera www.example.com/page1 5
digital www.example.com/page1 3
photography www.example.com/page1 2

After Multiple Pages:

Word Document Locations
camera page1 (5 times) → page2 (3 times) → page5 (7 times)
digital page1 (3 times) → page3 (4 times)
SLR page2 (8 times) → page4 (2 times) → page6 (5 times)

How Search Works

  1. User enters keyword "SLR"
  2. Engine looks up "SLR" in index table
  3. Retrieves all document records
  4. Sorts by number of appearances (ranking)
  5. Returns URLs in ranked order

Problem: If "SLR" isn't in the index, you get "no results found" - even if relevant pages exist!

The Problem: A Real Example 🎯

Traditional search limitations exposed

The Photography Scenario

You're an amateur photographer interested in buying a digital SLR camera. You want to learn about their performance characteristics.

Your search term: "SLR"

❌ The Core Problem

The search engine can ONLY match keywords. It cannot understand MEANING.

  • Vendor sites rank high because they use "SLR" many times
  • Informational sites rank low (fewer keyword occurrences)
  • Excellent sites are COMPLETELY MISSED (different terminology)

2.2 Semantic Web Search: Step 1 📚

Build a Common Vocabulary

The Key Insight

Semantics means MEANING. To add meaning to the Web, we need a way to EXPRESS meaning.

Solution: Build a vocabulary with meaning/knowledge coded into its terms.

Photography Domain Vocabulary

Camera
Digital
|
Film
SLR
|
Point-And-Shoot

Equivalence Relations:

SingleLensReflex
SLR
SLR-Camera

SLR has properties:

ShutterSpeed
Aperture

Benefits of This Vocabulary

  • Standard way to express meaning/knowledge in the photography domain
  • Common understanding shared by different parties on the Web
  • Machine-readable (structured data)
  • Captures relationships and equivalences

💡 Key Insight (3 min)

Now the computer "knows" that:

  • SLR = SingleLensReflex = SLR-Camera (same concept)
  • ShutterSpeed and Aperture are PROPERTIES of SLR
  • If a page discusses ShutterSpeed, it's likely about SLR cameras

This is structured knowledge that machines can process!

Step 2: Markup the Pages 🏷️

Connecting pages to vocabulary

Connecting Pages to Vocabulary

Once we have a common vocabulary (let's call it mySimpleCamera.owl), page owners need to explicitly indicate that words on their pages have the same semantics as concepts in the vocabulary.

The Markup Process

1
Create a special description file

This file says: "The word SLR on my pages means the same as SLR in mySimpleCamera.owl"

2
Link pages to the description file

Add a link in the page's metadata section

<HTML> <HEAD> <TITLE>The Performance of a Digital Camera</TITLE> <!-- This link connects the page to semantic markup --> <LINK rel="help" href="http://example.com/slr-markup.rdf"> ...other stuff </HEAD> <BODY> ...content about SLR cameras... </BODY> </HTML>

Real-World Examples

www.goodPhoto.com markup says:

"The word 'SLR' on my pages = SLR concept in mySimpleCamera.owl"

www.digcamhelp.com markup says:

"'shutter speed' on my pages = ShutterSpeed property in mySimpleCamera.owl
"'aperture' on my pages = Aperture property in mySimpleCamera.owl"

www.ehow.com markup says:

"'single lens reflex' on my pages = SingleLensReflex in mySimpleCamera.owl"

⚠️ Important Note

Vendor sites (cheapCameras.com, buyItHere.com) do NOT markup their pages because their semantics are about SELLING, not the technical concepts in the photography vocabulary.

Step 3: The Smart Crawler 🧠

Enhanced crawling with reasoning

Enhanced Crawling Process

Now the crawler can understand and reason about page content!

How the Smart Crawler Works

❌ Traditional Crawler

At www.cheapCameras.com:

  1. Sees word "SLR"
  2. Adds to index table
  3. No metadata? No problem!
  4. Just keyword matching

✓ Smart Crawler

At www.cheapCameras.com:

  1. Sees word "SLR"
  2. Checks: Any markup link? NO
  3. Adds to index: markupURL=none
  4. Flags as non-semantic

Smart Crawler Intelligence

At www.goodPhoto.com:

  1. Sees markup link → Downloads markup file
  2. Learns: "SLR" = SLR in mySimpleCamera.owl
  3. Downloads mySimpleCamera.owl for reference
  4. When finds "SLR", adds to index with markupURL=mySimpleCamera.owl

At www.digcamhelp.com (DOESN'T contain "SLR"):

  1. Sees markup link → Downloads markup file
  2. Learns: "shutter speed" = ShutterSpeed in mySimpleCamera.owl
  3. Recalls from vocabulary: ShutterSpeed is a property of SLR
  4. Infers: This page is about SLR!
  5. Adds entry under "SLR" in index table even though word doesn't appear!

At www.ehow.com (uses "single lens reflex"):

  1. Sees markup link → Downloads markup file
  2. Learns: "single lens reflex" = SingleLensReflex in mySimpleCamera.owl
  3. Recalls from vocabulary: SingleLensReflex ≡ SLR
  4. Infers: This page is ALSO about SLR!
  5. Adds entry under "SLR" in index table

🎯 The Magic

  • Crawler can REASON using the vocabulary
  • Understands equivalences (SLR = SingleLensReflex)
  • Understands relationships (ShutterSpeed is property of SLR)
  • Makes intelligent inferences
  • Finds relevant pages even without exact keywords!

Using the Semantic Search Engine 🎉

The power of semantic understanding

🎊 The Results

  • All vendor sites are GONE!
  • Only informational pages that match YOUR semantic intent
  • Pages found even with different terminology
  • Pages found based on related properties
  • Much smaller, much more relevant result set

💬 Discussion (10 min)

Questions to consider:

  1. What are the advantages and disadvantages of this approach?
  2. What challenges do you see in getting page owners to add markup?
  3. How could this help in YOUR field or industry?

Summary & Key Takeaways 🎓

Consolidating our understanding

Traditional Web Search

  • ❌ Keyword matching only
  • ❌ No understanding of meaning
  • ❌ Irrelevant results
  • ❌ Missed relevant pages
  • ❌ Manual filtering required

Semantic Web Search

  • ✓ Understands meaning
  • ✓ Uses structured vocabulary
  • ✓ Makes intelligent inferences
  • ✓ Finds all relevant pages
  • ✓ Filters automatically

🎯 Essential Components of Semantic Web

  • Structured Vocabulary: Domain-specific, machine-readable (e.g., mySimpleCamera.owl)
  • Markup/Metadata: Links pages to vocabulary concepts
  • Smart Agents/Crawlers: Can understand and reason about content
  • Inference Engines: Make logical deductions from structured data

📘 Improved Definition of Semantic Web

The Semantic Web is an extension of the current Web. It is constructed by linking current Web pages to a structured data set that indicates the semantics of the linked page. A smart agent, which is able to understand this structured data set, will then be able to conduct intelligent actions and make educated decisions on a global scale.

🔑 The Key to Implementation

The structured data set (common vocabulary) is called an ONTOLOGY.

Languages to build ontologies: RDF, RDFS, OWL

(We'll learn about these in Part 2!)

📝 Quick Knowledge Check

Q1: What is the main limitation of traditional search engines?

A) They are too slow
B) They can only do keyword matching and don't understand meaning
C) They don't index enough pages

Q2: In the Semantic Web, what is the structured vocabulary file commonly called?

A) A metadata schema
B) An ontology
C) A thesaurus

Q3: How does the smart crawler find pages about SLR cameras that use "single lens reflex"?

A) It guesses based on context
B) It uses markup metadata and vocabulary equivalences to infer the relationship
C) It searches for all possible synonyms

🚀 Looking Ahead: Part 2

In the next chapters, we'll learn the technical details:

  • RDF: Resource Description Framework
  • RDFS: RDF Schema
  • OWL: Web Ontology Language
  • Tools: How to build your own ontologies
1 / 11