Semantic Web Search Engine Revisited

Chapter 10: A Prototype System Design

Eng. Dr. Tiroshan Madushanka

Learning Objectives

What you will learn in this chapter

By the end of this lecture, you will be able to:

  • Explain why search engines are central to the Semantic Web vision
  • Identify the fundamental reasons why traditional search engines fail
  • Design a Semantic Web search engine prototype
  • Understand vertical and horizontal indexation strategies
  • Explain how semantic markup improves search performance
  • Implement a basic Semantic Web search system

πŸ’­ Opening Thought

We've learned about semantic markup β€” but how do we actually USE that markup to build better search engines? This chapter brings everything together!

Why Search Engines Again?

Section 10.1: Returning to the Root Challenge

The Connection Problem

Semantic markup creates a link between two worlds. But existing applications like Swoogle and FOAF don't fully exploit this link β€” they focus on one end or the other.

What About Swoogle?

Swoogle searches for Semantic Web documents (ontologies, RDF files). But it completely ignores the current Web β€” the billions of HTML pages that humans actually read.

What About FOAF?

FOAF connects personal Web pages to social semantics, but the link doesn't play a vital role. The reasoning power comes from aggregating markup files, not from the Web pages themselves.

The Original Motivation

The need for moving from the traditional Web to the Semantic Web arose mainly from the inefficiency of keyword-based search engines. Let's return to this root challenge!

Why Traditional Search Engines Fail

Section 10.2: The Two Fundamental Problems

The Obvious Answer

Web pages are created for human eyes. The machine-understandable information is just enough to display the page, not to understand it.

πŸ”€

Problem 1: Polysemy

The exact same term can have different meanings

"Java" β†’ Programming language? Coffee? Island?

πŸ”„

Problem 2: Synonymy

Different terms can mean exactly the same thing

"D70" = "D-70" = "Nikon D70" = "Nikon-D70"

🎯 Design Goal

Any Semantic Web search engine design MUST address both of these difficulties explicitly. Without solving both, we won't see significant improvement!

Designing the Prototype

Section 10.3: Four Key Questions to Answer

Four Design Questions

To design any search engine, we must answer these fundamental questions:

  • 1. Discovery Strategy β€” How are documents discovered?

    Traditional crawlers start with seed URLs and follow links. But finding semantically enhanced pages (SEPs) this way is like searching for needles in a haystack. We need a more focused approach.

  • 2. Indexation Strategy β€” How is the indexation done?

    We'll use two types of indexation: Vertical indexation (traversing the ontology hierarchy) and Horizontal indexation (scanning the markup document properties).

  • 3. Query Processing Strategy β€” What does the UI look like?

    Users should just type a few words β€” no Semantic Web knowledge required! The interface should feel exactly like traditional search engines.

  • 4. Result Screening Strategy β€” How are results sorted?

    Results should be HTML pages that humans can read, not RDF documents or ontology files. This is what users expect and need.

Query Processing: The User Interface

Section 10.3.1: Keep it Simple!

Two Golden Rules

Based on empirical research: most users type only 2-3 words and rarely use advanced search options.

❌ Bad: Complex Query

  • Requires Semantic Web knowledge
  • Scares users away
  • [foaf:knows].[foaf:name]~"Liyang Yu"

βœ“ Good: Simple Query

  • Just type keywords
  • Feels familiar
  • digital SLR camera review

Design Requirements

  • Users must be able to simply type a few words to begin a search
  • Search results should be links to HTML pages, not RDF/OWL documents
  • Users should NOT need to know there's a semantic layer underneath

The One Exception

Users DO need to specify the domain/ontology β€” typically via a dropdown menu. This defines the semantic context for the search.

Discovery: More Focused Crawling

Section 10.3.2: Finding Semantically Enhanced Pages

The Haystack Problem

Google indexed 4-5 billion pages (2005), but that's only ~1% of the Web. Finding semantically enhanced pages (SEPs) through normal crawling is simply not practical.

The Solution: Reverse Discovery

Instead of crawling the Web to find markup, use Swoogle's APIs to find all RDF documents using a specific ontology, then trace back to their Web pages!

1

Query Swoogle

Take the domain ontology and use Swoogle's Web services to find all RDF instance documents created using this ontology.

2

Identify Markup Documents

For each RDF document, check if it's a markup document for some Web page. Look for the owl:ontology tag with a URL label pointing back to the original page.

3

Index the Web Page

Download the Web page and index it using the semantic information from the markup document. Repeat until all RDF documents are processed.

Markup Document Requirements

How the crawler finds the original Web page

The Key Addition

Markup documents must include an owl:ontology tag that points back to the Web page being marked up.

<?xml version="1.0"?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" ... other namespaces ... > <owl:ontology> <rdfs:label>markup</rdfs:label> <rdfs:label>http://www.example.com/NikonD70Review.html</rdfs:label> </owl:ontology> ... rest of the markup document ... </rdf:RDF>

How the Crawler Uses This

  • Load the RDF document into memory and parse it
  • Look for the "markup" label in owl:ontology
  • Extract the URL label pointing to the original Web page
  • If no such labels exist, discard β€” it's not a markup document

Vertical Indexation

Section 10.3.3.1: Traversing the Ontology Hierarchy

Purpose

Vertical indexation addresses the synonymy problem β€” different terms meaning the same thing β€” by collecting all related class names from the ontology hierarchy.

1

Parse & Collect Instances

Parse the RDF markup document to create a collection of all instances described in the document.

2

For Each Instance, Traverse Upward

Find the class type β†’ Find all superclasses β†’ Find all equivalent classes β†’ Find equivalent classes of superclasses β†’ Index all these class names!

Example: Nikon-D70 Instance

Type: camera:SLR
Superclasses: camera:Digital, camera:Camera
Equivalent: DigitalCamera, SingleLensReflex
All become index keywords!

Vertical Indexation: Visual Example

Building the index from ontology traversal

Index Table After Vertical Indexation

SLR β†’
D70 Review Page Canon 300D Page ...
Camera β†’
D70 Review Page Point & Shoot Guide ...
Digital β†’
D70 Review Page Digital vs Film ...
Photographer β†’
D70 Review Page Portrait Tips ...
Person β†’
D70 Review Page Author Bio ...

πŸ”‘ Key Insight

The D70 Review page is now indexed under multiple conceptually related terms β€” even though the original page may not contain all these words!

Horizontal Indexation

Section 10.3.3.2: Scanning the Markup Document

Purpose

Horizontal indexation extracts keywords from the actual content of the markup document β€” instance names, labels, property names, and values.

Step Action Example Keywords
2.1 Index instance local name Nikon-D70, LiyangYu
2.2 Index rdfs:label values D-70, Liyang Yu
2.3 Index property local names pixel, has_spec, model
2.3.1 Index string property values D70 Review, Liyang Yu
2.3.2 Add nested instances to queue genid:ARP4420 (Specifications)

Note: Numerical Values Are Ignored

The value "6" for camera:pixel is not indexed because it's non-string (numerical). This prevents meaningless matches.

The Complete Index

Combining Vertical + Horizontal Indexation

Final Index Table for D70 Review Page

SLR
Camera
Digital
Photographer
Person
Nikon-D70
D70
D-70
Review
Liyang
Yu
Pixel
Specification
Model

Why NOT Index the Original HTML Page?

  • The semantics are captured in the markup document
  • Unmarked-up info is assumed unrelated to the domain semantics
  • This keeps the index focused and relevant

⚠️ Critical Implication

This makes the markup document crucial. You must capture ALL important information carefully. If your ontology can't express vital info, choose a different ontology or create your own!

Using the Prototype System

Section 10.4: Domain Selection and Search

The One User Requirement

Users must specify the domain/ontology before searching. This defines the semantic context and determines which index table to use.

Multi-Domain Search Engine Architecture

Photography

Camera ontology

Medicine

Medical ontology

Finance

Finance ontology

Education

Edu ontology

↓ ↓ ↓ ↓
Indexβ‚€

Photo pages

Index₁

Medical pages

Indexβ‚‚

Finance pages

Index₃

Edu pages

Same Word, Different Contexts

"SLR" appears in both Photography and Finance indices β€” but with completely different meanings (camera type vs. financial instrument). Domain selection ensures the right context!

Why Better Performance?

Section 10.5: Addressing the Two Problems

βœ“ Solving Problem 1: Polysemy

  • Ontology defines semantics unambiguously
  • Keywords derived from ontology classes/properties
  • User selects domain β†’ context is fixed
  • Same word can exist in multiple indices with different meanings

βœ“ Solving Problem 2: Synonymy

  • Vertical indexation collects superclasses
  • rdfs:label captures synonyms
  • Equivalent classes are indexed
  • Search can find pages without exact keyword match!

The Power of rdfs:label

Use rdfs:label liberally to add synonyms! There's no limit to how many labels you can use for a single instance or class.

<rdf:Description rdf:about="...#someMap"> <rdf:type rdf:resource="...#Map"/> <rdfs:label>map</rdfs:label> <rdfs:label>chart</rdfs:label> <rdfs:label>atlas</rdfs:label> <rdfs:label>drawing</rdfs:label> <rdfs:label>diagram</rdfs:label> </rdf:Description>

Interactive: Search Comparison

Traditional vs. Semantic Search

πŸ” Search for "digital camera"

Domain: Photography (Camera Ontology)

Traditional Search

Canon Printer Digital Interface

Contains "digital" and "Canon" (camera brand)...

Digital Marketing Camera Ready

Contains "digital" and "camera"...

Nikon D70 Review

Contains "digital" in text...

Semantic Search

Nikon D70 Review βœ“

Type: SLR β†’ Digital β†’ Camera (ontology match)

Canon 300D Guide βœ“

Type: DigitalCamera (equivalent class match)

Sony Alpha Tutorial βœ“

Type: Camera (superclass match)

πŸ’­ Notice the Difference

Traditional search returns irrelevant pages containing keywords. Semantic search returns pages whose MEANING matches, even without exact keyword matches!

Implementation Exercise

Section 10.6: Build Your Own!

Hands-On Learning

Build both a traditional and semantic search engine on your local machine to truly understand the difference!

1

Preparation

Create HTML documents about digital cameras on your local drive. Include both sales pages (prices, vendors) and review pages (performance, specs).

2

Build Traditional Search Engine

Write a Java crawler to traverse directories, find HTML files, and create a full-text index. Search for "SLR" β€” you'll see ALL camera pages returned.

3

Build Semantic Search Engine

Markup the review pages using the camera ontology. Rewrite your crawler to use Jena APIs for vertical and horizontal indexation. Search "SLR" β€” only relevant review pages return!

Tools You'll Need

  • Java β€” for the crawler and search interface
  • Jena APIs β€” for parsing RDF and reasoning
  • Database β€” for storing the index tables
  • Camera Ontology β€” your domain model

Knowledge Check

Test your understanding

Quiz: Semantic Search Engines

Which indexation type addresses the synonymy problem (different terms, same meaning)?

Why must users specify a domain before searching?

Chapter Summary

Key takeaways from Lecture 10

What We Learned

  • Traditional search fails due to polysemy (same term, different meanings) and synonymy (different terms, same meaning)
  • Focused crawling uses Swoogle to find markup documents, then traces back to Web pages
  • Vertical indexation traverses the ontology hierarchy to collect synonymous class names
  • Horizontal indexation extracts keywords from markup content (instances, labels, properties)
  • Users specify domain/ontology to establish semantic context
  • The rdfs:label property is powerful for capturing synonyms

The Big Picture

This prototype shows how semantic markup can dramatically improve search by making meaning β€” not just keywords β€” the basis for matching queries to pages.

πŸ“š What's Next?

Part 4 begins: Semantic Web Services! We'll apply these same principles to automatically discover, invoke, and compose Web services.

1 / 18