Search Engines in Traditional and Semantic Web

Chapter 2: Understanding How Search Works

Eng. Dr. Tiroshan Madushanka

Chapter 2 Overview 📚

Comparing traditional and semantic search

What We'll Learn

Before diving into the technical nuts and bolts of the Semantic Web, we'll understand how search engines work in both environments.

Goal: By comparing traditional and semantic search engines, we'll see the clear benefits and identify vital components of the Semantic Web.

Today's Journey

How traditional search engines build index tables
How crawlers work (depth-first vs breadth-first)
The role of metadata in search
Building a hypothetical Semantic Web search engine
Real-world example: Searching for digital SLR cameras
Identifying key components needed for the Semantic Web

🤔 Opening Question (5 min)

Think about your last frustrating search experience:

What made it frustrating? Too many results? Wrong results? Having to try multiple search terms?

The Central Question

Remember searching for "SOAP" in Chapter 1? We got dish detergent, soap operas, and everything except the W3C protocol we wanted.

Today we'll understand WHY this happens and HOW the Semantic Web can fix it.

2.1 Traditional Search Engines 🔍

Understanding the basics

The Current Reality

The Web has hundreds of millions of pages, and search engines (Google, Yahoo, etc.) are how we find information.

Key Fact: All search engines are essentially keyword matchers.

How It Works: The Indexation Process

1

Before the search engine launches: It prepares a huge index table

This is done by a special software called a crawler or spider

2

The crawler starts with seed URLs: Given by the main control component

Example: Start at www.example.com

3

Build index for every word: Create entries mapping words to URLs

Also track how many times each word appears (for ranking)

4

Follow links and repeat: Move to the next page and continue

This continues until resources are exhausted

Staggering Scale

Google can index about 4-5 billion web pages.

That's only 1% of the entire World Wide Web!

How a Crawler Works: Interactive Demo 🕷️

Visualizing the crawling process

Status: Ready

Seed
Page

Page 1

Page 2

Page 3

Page 4

Page 5

Page 6

🕷️

What You're Seeing: Depth-First Search

The crawler starts at the seed, processes the entire first page, then follows the FIRST link completely before coming back to other links.

Alternative: Breadth-first search would visit all pages at one level before going deeper.

💭 Think About It (3 min)

Which search strategy do you think is better: depth-first or breadth-first? Why?

Consider: freshness of content, memory usage, likelihood of finding popular pages

The Index Table: The Heart of Search 📊

Understanding data storage

What Gets Stored

For each word found, the crawler stores:

Word/Term: The keyword itself
Document Location: URL where it appears
Number of Appearances: How many times (for ranking)

Evolution of the Index Table

Initial Table (First Page):

Word	Document Location	Appearances
camera	www.example.com/page1	5
digital	www.example.com/page1	3
photography	www.example.com/page1	2

After Multiple Pages:

Word	Document Locations
camera	page1 (5 times) → page2 (3 times) → page5 (7 times)
digital	page1 (3 times) → page3 (4 times)
SLR	page2 (8 times) → page4 (2 times) → page6 (5 times)

How Search Works

User enters keyword "SLR"
Engine looks up "SLR" in index table
Retrieves all document records
Sorts by number of appearances (ranking)
Returns URLs in ranked order

Problem: If "SLR" isn't in the index, you get "no results found" - even if relevant pages exist!

The Problem: A Real Example 🎯

Traditional search limitations exposed

The Photography Scenario

You're an amateur photographer interested in buying a digital SLR camera. You want to learn about their performance characteristics.

Your search term: "SLR"

Traditional Search Engine Demo

Search Results for "SLR" - 128,000,000 results

1. www.cheapCameras.com - Buy SLR Cameras Today!

Best deals on SLR cameras. Shop now for lowest prices on digital SLR...

❌ Vendor site - not what you want

2. www.buyItHere.com - SLR Camera Sale

Huge SLR camera sale! Canon, Nikon, Sony SLR cameras in stock...

❌ Another vendor site

3. www.photoStore.com - Professional SLR Equipment

Professional SLR cameras and lenses. Free shipping on SLR orders...

❌ More selling, not information

47. www.goodPhoto.com - Understanding SLR Performance

Learn about SLR camera shutter speed, aperture, and performance metrics...

✅ This is what you want! (But you have to scroll through 46 vendor sites first)

❓ www.digcamhelp.com - Camera Shutter Speed and Aperture

Detailed guide to understanding shutter speed and aperture in digital cameras...

⚠️ NEVER FOUND - doesn't use "SLR", says "single lens reflex"

❓ www.ehow.com - How Single Lens Reflex Cameras Work

Complete guide to single lens reflex camera technology and features...

⚠️ NEVER FOUND - uses "single lens reflex" instead of "SLR"

❌ The Core Problem

The search engine can ONLY match keywords. It cannot understand MEANING.

Vendor sites rank high because they use "SLR" many times
Informational sites rank low (fewer keyword occurrences)
Excellent sites are COMPLETELY MISSED (different terminology)

2.2 Semantic Web Search: Step 1 📚

Build a Common Vocabulary

The Key Insight

Semantics means MEANING. To add meaning to the Web, we need a way to EXPRESS meaning.

Solution: Build a vocabulary with meaning/knowledge coded into its terms.

Photography Domain Vocabulary

Camera

Digital

|

Film

SLR

|

Point-And-Shoot

Equivalence Relations:

SingleLensReflex

≡

SLR

≡

SLR-Camera

SLR has properties:

ShutterSpeed

Aperture

Benefits of This Vocabulary

Standard way to express meaning/knowledge in the photography domain
Common understanding shared by different parties on the Web
Machine-readable (structured data)
Captures relationships and equivalences

💡 Key Insight (3 min)

Now the computer "knows" that:

SLR = SingleLensReflex = SLR-Camera (same concept)
ShutterSpeed and Aperture are PROPERTIES of SLR
If a page discusses ShutterSpeed, it's likely about SLR cameras

This is structured knowledge that machines can process!

Step 2: Markup the Pages 🏷️

Connecting pages to vocabulary

Connecting Pages to Vocabulary

Once we have a common vocabulary (let's call it mySimpleCamera.owl), page owners need to explicitly indicate that words on their pages have the same semantics as concepts in the vocabulary.

The Markup Process

1

Create a special description file

This file says: "The word SLR on my pages means the same as SLR in mySimpleCamera.owl"

2

Link pages to the description file

Add a link in the page's metadata section

<HTML>
<HEAD>
  <TITLE>The Performance of a Digital Camera</TITLE>
  
  <!-- This link connects the page to semantic markup -->
  <LINK rel="help" href="http://example.com/slr-markup.rdf">
  
  ...other stuff
</HEAD>
<BODY>
  ...content about SLR cameras...
</BODY>
</HTML>

Real-World Examples

www.goodPhoto.com markup says:

"The word 'SLR' on my pages = SLR concept in mySimpleCamera.owl"

www.digcamhelp.com markup says:

"'shutter speed' on my pages = ShutterSpeed property in mySimpleCamera.owl
"'aperture' on my pages = Aperture property in mySimpleCamera.owl"

www.ehow.com markup says:

"'single lens reflex' on my pages = SingleLensReflex in mySimpleCamera.owl"

⚠️ Important Note

Vendor sites (cheapCameras.com, buyItHere.com) do NOT markup their pages because their semantics are about SELLING, not the technical concepts in the photography vocabulary.

Step 3: The Smart Crawler 🧠

Enhanced crawling with reasoning

Enhanced Crawling Process

Now the crawler can understand and reason about page content!

How the Smart Crawler Works

❌ Traditional Crawler

At www.cheapCameras.com:

Sees word "SLR"
Adds to index table
No metadata? No problem!
Just keyword matching

✓ Smart Crawler

At www.cheapCameras.com:

Sees word "SLR"
Checks: Any markup link? NO
Adds to index: markupURL=none
Flags as non-semantic

Smart Crawler Intelligence

At www.goodPhoto.com:

Sees markup link → Downloads markup file
Learns: "SLR" = SLR in mySimpleCamera.owl
Downloads mySimpleCamera.owl for reference
When finds "SLR", adds to index with markupURL=mySimpleCamera.owl

At www.digcamhelp.com (DOESN'T contain "SLR"):

Sees markup link → Downloads markup file
Learns: "shutter speed" = ShutterSpeed in mySimpleCamera.owl
Recalls from vocabulary: ShutterSpeed is a property of SLR
Infers: This page is about SLR!
Adds entry under "SLR" in index table even though word doesn't appear!

At www.ehow.com (uses "single lens reflex"):

Sees markup link → Downloads markup file
Learns: "single lens reflex" = SingleLensReflex in mySimpleCamera.owl
Recalls from vocabulary: SingleLensReflex ≡ SLR
Infers: This page is ALSO about SLR!
Adds entry under "SLR" in index table

🎯 The Magic

Crawler can REASON using the vocabulary
Understands equivalences (SLR = SingleLensReflex)
Understands relationships (ShutterSpeed is property of SLR)
Makes intelligent inferences
Finds relevant pages even without exact keywords!

Using the Semantic Search Engine 🎉

The power of semantic understanding

Semantic Search Engine Demo

Select Vocabulary (Semantics):

This tells the engine which MEANING of "SLR" you want

Semantic Search Results for "SLR" (Photography context)

✅ Only returning pages with markup pointing to mySimpleCamera.owl

1. www.goodPhoto.com - Understanding SLR Performance

Learn about SLR camera shutter speed, aperture, and performance metrics...

✅ Contains "SLR" + Marked up with mySimpleCamera.owl

2. www.digcamhelp.com - Camera Shutter Speed and Aperture

Detailed guide to understanding shutter speed and aperture in digital cameras...

✅ Doesn't contain "SLR" but marked up as discussing SLR properties!

3. www.ehow.com - How Single Lens Reflex Cameras Work

Complete guide to single lens reflex camera technology and features...

✅ Uses "single lens reflex" but marked up as equivalent to SLR!

❌ Excluded from results:

www.cheapCameras.com - No markup (markupURL = none)
www.buyItHere.com - No markup (markupURL = none)
All other vendor sites - No semantic markup

🎊 The Results

All vendor sites are GONE!
Only informational pages that match YOUR semantic intent
Pages found even with different terminology
Pages found based on related properties
Much smaller, much more relevant result set

💬 Discussion (10 min)

Questions to consider:

What are the advantages and disadvantages of this approach?
What challenges do you see in getting page owners to add markup?
How could this help in YOUR field or industry?

Summary & Key Takeaways 🎓

Consolidating our understanding

Traditional Web Search

❌ Keyword matching only
❌ No understanding of meaning
❌ Irrelevant results
❌ Missed relevant pages
❌ Manual filtering required

Semantic Web Search

✓ Understands meaning
✓ Uses structured vocabulary
✓ Makes intelligent inferences
✓ Finds all relevant pages
✓ Filters automatically

🎯 Essential Components of Semantic Web

Structured Vocabulary: Domain-specific, machine-readable (e.g., mySimpleCamera.owl)
Markup/Metadata: Links pages to vocabulary concepts
Smart Agents/Crawlers: Can understand and reason about content
Inference Engines: Make logical deductions from structured data

📘 Improved Definition of Semantic Web

The Semantic Web is an extension of the current Web. It is constructed by linking current Web pages to a structured data set that indicates the semantics of the linked page. A smart agent, which is able to understand this structured data set, will then be able to conduct intelligent actions and make educated decisions on a global scale.

🔑 The Key to Implementation

The structured data set (common vocabulary) is called an ONTOLOGY.

Languages to build ontologies: RDF, RDFS, OWL

(We'll learn about these in Part 2!)

📝 Quick Knowledge Check

Q1: What is the main limitation of traditional search engines?

A) They are too slow

B) They can only do keyword matching and don't understand meaning

C) They don't index enough pages

Q2: In the Semantic Web, what is the structured vocabulary file commonly called?

A) A metadata schema

B) An ontology

C) A thesaurus

Q3: How does the smart crawler find pages about SLR cameras that use "single lens reflex"?

A) It guesses based on context

B) It uses markup metadata and vocabulary equivalences to infer the relationship

C) It searches for all possible synonyms

🚀 Looking Ahead: Part 2

In the next chapters, we'll learn the technical details:

RDF: Resource Description Framework
RDFS: RDF Schema
OWL: Web Ontology Language
Tools: How to build your own ontologies