We've learned about semantic markup β but how do we actually USE that markup to build better search engines? This chapter brings everything together!
Semantic markup creates a link between two worlds. But existing applications like Swoogle and FOAF don't fully exploit this link β they focus on one end or the other.
Swoogle searches for Semantic Web documents (ontologies, RDF files). But it completely ignores the current Web β the billions of HTML pages that humans actually read.
FOAF connects personal Web pages to social semantics, but the link doesn't play a vital role. The reasoning power comes from aggregating markup files, not from the Web pages themselves.
The need for moving from the traditional Web to the Semantic Web arose mainly from the inefficiency of keyword-based search engines. Let's return to this root challenge!
Web pages are created for human eyes. The machine-understandable information is just enough to display the page, not to understand it.
The exact same term can have different meanings
"Java" β Programming language? Coffee? Island?
Different terms can mean exactly the same thing
"D70" = "D-70" = "Nikon D70" = "Nikon-D70"
Any Semantic Web search engine design MUST address both of these difficulties explicitly. Without solving both, we won't see significant improvement!
To design any search engine, we must answer these fundamental questions:
Traditional crawlers start with seed URLs and follow links. But finding semantically enhanced pages (SEPs) this way is like searching for needles in a haystack. We need a more focused approach.
We'll use two types of indexation: Vertical indexation (traversing the ontology hierarchy) and Horizontal indexation (scanning the markup document properties).
Users should just type a few words β no Semantic Web knowledge required! The interface should feel exactly like traditional search engines.
Results should be HTML pages that humans can read, not RDF documents or ontology files. This is what users expect and need.
Based on empirical research: most users type only 2-3 words and rarely use advanced search options.
Users DO need to specify the domain/ontology β typically via a dropdown menu. This defines the semantic context for the search.
Google indexed 4-5 billion pages (2005), but that's only ~1% of the Web. Finding semantically enhanced pages (SEPs) through normal crawling is simply not practical.
Instead of crawling the Web to find markup, use Swoogle's APIs to find all RDF documents using a specific ontology, then trace back to their Web pages!
Take the domain ontology and use Swoogle's Web services to find all RDF instance documents created using this ontology.
For each RDF document, check if it's a markup document for some Web page. Look for the owl:ontology tag with a URL label pointing back to the original page.
Download the Web page and index it using the semantic information from the markup document. Repeat until all RDF documents are processed.
Markup documents must include an owl:ontology tag that points back to the Web page being marked up.
owl:ontologyVertical indexation addresses the synonymy problem β different terms meaning the same thing β by collecting all related class names from the ontology hierarchy.
Parse the RDF markup document to create a collection of all instances described in the document.
Find the class type β Find all superclasses β Find all equivalent classes β Find equivalent classes of superclasses β Index all these class names!
Type: camera:SLR
Superclasses: camera:Digital, camera:Camera
Equivalent: DigitalCamera, SingleLensReflex
All become index keywords!
The D70 Review page is now indexed under multiple conceptually related terms β even though the original page may not contain all these words!
Horizontal indexation extracts keywords from the actual content of the markup document β instance names, labels, property names, and values.
| Step | Action | Example Keywords |
|---|---|---|
| 2.1 | Index instance local name | Nikon-D70, LiyangYu |
| 2.2 | Index rdfs:label values |
D-70, Liyang Yu |
| 2.3 | Index property local names | pixel, has_spec, model |
| 2.3.1 | Index string property values | D70 Review, Liyang Yu |
| 2.3.2 | Add nested instances to queue | genid:ARP4420 (Specifications) |
The value "6" for camera:pixel is not indexed because it's non-string (numerical). This prevents meaningless matches.
This makes the markup document crucial. You must capture ALL important information carefully. If your ontology can't express vital info, choose a different ontology or create your own!
Users must specify the domain/ontology before searching. This defines the semantic context and determines which index table to use.
Camera ontology
Medical ontology
Finance ontology
Edu ontology
Photo pages
Medical pages
Finance pages
Edu pages
"SLR" appears in both Photography and Finance indices β but with completely different meanings (camera type vs. financial instrument). Domain selection ensures the right context!
rdfs:label captures synonymsUse rdfs:label liberally to add synonyms! There's no limit to how many labels you can use for a single instance or class.
Domain: Photography (Camera Ontology)
Contains "digital" and "Canon" (camera brand)...
Contains "digital" and "camera"...
Contains "digital" in text...
Type: SLR β Digital β Camera (ontology match)
Type: DigitalCamera (equivalent class match)
Type: Camera (superclass match)
Traditional search returns irrelevant pages containing keywords. Semantic search returns pages whose MEANING matches, even without exact keyword matches!
Build both a traditional and semantic search engine on your local machine to truly understand the difference!
Create HTML documents about digital cameras on your local drive. Include both sales pages (prices, vendors) and review pages (performance, specs).
Write a Java crawler to traverse directories, find HTML files, and create a full-text index. Search for "SLR" β you'll see ALL camera pages returned.
Markup the review pages using the camera ontology. Rewrite your crawler to use Jena APIs for vertical and horizontal indexation. Search "SLR" β only relevant review pages return!
Which indexation type addresses the synonymy problem (different terms, same meaning)?
Why must users specify a domain before searching?
rdfs:label property is powerful for capturing synonymsThis prototype shows how semantic markup can dramatically improve search by making meaning β not just keywords β the basis for matching queries to pages.
Part 4 begins: Semantic Web Services! We'll apply these same principles to automatically discover, invoke, and compose Web services.