Comparing traditional and semantic search
Before diving into the technical nuts and bolts of the Semantic Web, we'll understand how search engines work in both environments.
Goal: By comparing traditional and semantic search engines, we'll see the clear benefits and identify vital components of the Semantic Web.
Think about your last frustrating search experience:
What made it frustrating? Too many results? Wrong results? Having to try multiple search terms?
Remember searching for "SOAP" in Chapter 1? We got dish detergent, soap operas, and everything except the W3C protocol we wanted.
Today we'll understand WHY this happens and HOW the Semantic Web can fix it.
Understanding the basics
The Web has hundreds of millions of pages, and search engines (Google, Yahoo, etc.) are how we find information.
Key Fact: All search engines are essentially keyword matchers.
This is done by a special software called a crawler or spider
Example: Start at www.example.com
Also track how many times each word appears (for ranking)
This continues until resources are exhausted
Google can index about 4-5 billion web pages.
That's only 1% of the entire World Wide Web!
Visualizing the crawling process
The crawler starts at the seed, processes the entire first page, then follows the FIRST link completely before coming back to other links.
Alternative: Breadth-first search would visit all pages at one level before going deeper.
Which search strategy do you think is better: depth-first or breadth-first? Why?
Consider: freshness of content, memory usage, likelihood of finding popular pages
Understanding data storage
For each word found, the crawler stores:
| Word | Document Location | Appearances |
|---|---|---|
| camera | www.example.com/page1 | 5 |
| digital | www.example.com/page1 | 3 |
| photography | www.example.com/page1 | 2 |
| Word | Document Locations |
|---|---|
| camera | page1 (5 times) → page2 (3 times) → page5 (7 times) |
| digital | page1 (3 times) → page3 (4 times) |
| SLR | page2 (8 times) → page4 (2 times) → page6 (5 times) |
Problem: If "SLR" isn't in the index, you get "no results found" - even if relevant pages exist!
Traditional search limitations exposed
You're an amateur photographer interested in buying a digital SLR camera. You want to learn about their performance characteristics.
Your search term: "SLR"
Best deals on SLR cameras. Shop now for lowest prices on digital SLR...
❌ Vendor site - not what you wantHuge SLR camera sale! Canon, Nikon, Sony SLR cameras in stock...
❌ Another vendor siteProfessional SLR cameras and lenses. Free shipping on SLR orders...
❌ More selling, not informationLearn about SLR camera shutter speed, aperture, and performance metrics...
✅ This is what you want! (But you have to scroll through 46 vendor sites first)Detailed guide to understanding shutter speed and aperture in digital cameras...
⚠️ NEVER FOUND - doesn't use "SLR", says "single lens reflex"Complete guide to single lens reflex camera technology and features...
⚠️ NEVER FOUND - uses "single lens reflex" instead of "SLR"The search engine can ONLY match keywords. It cannot understand MEANING.
Build a Common Vocabulary
Semantics means MEANING. To add meaning to the Web, we need a way to EXPRESS meaning.
Solution: Build a vocabulary with meaning/knowledge coded into its terms.
Equivalence Relations:
SLR has properties:
Now the computer "knows" that:
This is structured knowledge that machines can process!
Connecting pages to vocabulary
Once we have a common vocabulary (let's call it mySimpleCamera.owl), page owners need to explicitly indicate that words on their pages have the same semantics as concepts in the vocabulary.
This file says: "The word SLR on my pages means the same as SLR in mySimpleCamera.owl"
Add a link in the page's metadata section
www.goodPhoto.com markup says:
"The word 'SLR' on my pages = SLR concept in mySimpleCamera.owl"
www.digcamhelp.com markup says:
"'shutter speed' on my pages = ShutterSpeed property in mySimpleCamera.owl
"'aperture' on my pages = Aperture property in mySimpleCamera.owl"
www.ehow.com markup says:
"'single lens reflex' on my pages = SingleLensReflex in mySimpleCamera.owl"
Vendor sites (cheapCameras.com, buyItHere.com) do NOT markup their pages because their semantics are about SELLING, not the technical concepts in the photography vocabulary.
Enhanced crawling with reasoning
Now the crawler can understand and reason about page content!
At www.cheapCameras.com:
At www.cheapCameras.com:
At www.goodPhoto.com:
At www.digcamhelp.com (DOESN'T contain "SLR"):
At www.ehow.com (uses "single lens reflex"):
The power of semantic understanding
This tells the engine which MEANING of "SLR" you want
✅ Only returning pages with markup pointing to mySimpleCamera.owl
Learn about SLR camera shutter speed, aperture, and performance metrics...
✅ Contains "SLR" + Marked up with mySimpleCamera.owlDetailed guide to understanding shutter speed and aperture in digital cameras...
✅ Doesn't contain "SLR" but marked up as discussing SLR properties!Complete guide to single lens reflex camera technology and features...
✅ Uses "single lens reflex" but marked up as equivalent to SLR!Questions to consider:
Consolidating our understanding
The Semantic Web is an extension of the current Web. It is constructed by linking current Web pages to a structured data set that indicates the semantics of the linked page. A smart agent, which is able to understand this structured data set, will then be able to conduct intelligent actions and make educated decisions on a global scale.
The structured data set (common vocabulary) is called an ONTOLOGY.
Languages to build ontologies: RDF, RDFS, OWL
(We'll learn about these in Part 2!)
Q1: What is the main limitation of traditional search engines?
Q2: In the Semantic Web, what is the structured vocabulary file commonly called?
Q3: How does the smart crawler find pages about SLR cameras that use "single lens reflex"?
In the next chapters, we'll learn the technical details: