Yale Homework 2

Homework 2: Querying DAML data:

We were asked to work with "the collected results of homework assignment 1, augmented, if desired, by hypothetical DAML repositories..." The collected results tend to be heavy on ontology and low on content, so we have to imagine that the ontologies have been fleshed out with a more nearly complete set of data concernin organization charts, bibliographies, software libraries, schedules, etc. I'll make it clear as I go exactly which repositories I'm assuming, and where they will be located.

I'm also going to assume that a given fact is expressed using the same ontology everywhere. There are important issues raised by the fact that this assumption is not going to be true (and in fact these issues were the focus of our DAML proposal), but there are other matters to focus on first.

I'm going to address two major issues: What should the query language look like, and how can we answer queries efficiently?

I'm going to assume that the query language is essentially Prolog expressed in XML/RDF using DAML concepts. I start with the version of RDF I propose in my recent proposal. This version is somewhat more concise and allows me to ignore reification headaches. However, the basic ideas should work fine with the original RDF.

In Prolog, and other deductive query engines, a query is just a statement with logic variables sprinkled among its terms. (For that matter, SQL uses the same idea.) There may be efficiency concerns raised by using this approach; I'll address them below.

To XML-ify Prolog, we simply provide a variant of rdf:Description called rdf:Query. If I write

    <rdf:Query var="x">
        Description stuff
    </rdf:Query>

that's requesting that an entity "x" be found satisfying the description.

Some Queries

For example, to express, "What happened on November 16?" I would write:
```
   <rdf:Query var="v">
      <d:timestamp>2000-11-16</d:timestamp>
   </rdf:Query>
```
(For some reason ninety percent of the homework 1 db's seem to consist of statements about various gensyms occuring on November 16.) I use the prefix rdf: for symbols I expect to find everywhere in RDF, the prefix d: for things specific to DAML, and the prefix x: for things specific to some ontology. The exact boundaries between these are not important at this point.
It may seem overly clumsy to use XML to express queries. If so, we could always use a more concise "interface" language to communicate with knowledge engineers, and translate to XML only when transmitting queries across the internet. Of course, we could do that on the db/ontology side as well. (And we no doubt will very shortly.)

For a more substantial example, we could ask "Where can I download DAML editing tools?"

<rdf:Query var="site" type="d:website">
   <rdf:Assertion>
      <rdf:Exists var="e" type="d:software">
	 <d:downloadable ref="site"/>
	 <x:software_purpose>
	    <x:Editor>
	       <x:target_notation resource=""http://www.daml.org/daml/notation"/>
	    </x:Editor>
	 </x:software_purpose>
      </rdf:Exists>
   </rdf:Assertion>
</rdf:Query>

"How many people are involved in the DAML project for institution Y?"

Let's assume that there is a DAML Class x:Project defined in our hypothetical ontology. The query would be written:

<rdf:Query var="bag">
   <rdf:Query var="n">
      <rdf:Bag_of_all ref="bag">
	 <rdf:Query var="person" op="Exists">
	    <x:Project var="proj" op="Exists">
	       <x:project_site resource="Y"/>
	       <x:sub_project_of resource="http://www.daml.org/daml"/>
	       <rdf:Description about="person">
	          <x:participates_in ref="proj"/>
	       </rdf:Description>
	    </x:Project>
	 </rdf:Query>
      </rdf:Bag_of_all>
      <rdf:Description ref="bag">
	 <x:size ref="n"/>
      </rdf:Description>
   </rdf:Query>
</rdf:Query>

The construct {description} is satisfied by setting vv to a bag of all entities that satisfy the description. In this case that means all persons that participate in a subproject project of DAML located at institution Y.

"What is the maximum distance of any DAML researcher from Washington, D.C.?"

<rdf:Query var="bag">
   <rdf:Query var="d" type="Integer">
       <rdf:Bag_of_all ref="bag">
	  <rdf:Query var="person" op="Exists">
             <x:Project var="proj" op="Exists">
		<x:sub_project_of resource="http://www.daml.org/daml"/>
		<rdf:Description ref="person">
		   <x:participates_in ref="proj"/>
		</rdf:Description>
	     </x:Project>
	  </rdf:Query>
       </rdf:Bag_of_all>
       <rdf:Description about="bag">
	  <rdf:Max>
	      <rdf:Arg>
		 <rdf:Function>
		    <rdf:Term var="p" type="x:Person">
		       <x:dist>
			  <rdf:Arg ref="p"/>
			  <rdf:Arg resource="&foo;WashgintonDC"/>
		       </x:dist>
		    </rdf:Term>
		 </rdf:Function>
	      </rdf:Arg>
	  </rdf:Max>
       </rdf:Description>
   </rdf:Query>
</rdf:Query>

"What DAML researchers from different institutions have collaborated on a paper?"

<rdf:Query var="p1" type="x:Person">
   <rdf:Query var="p2" type="x:Person">   
       <rdf:Description ref="p1">
	    <x:Project var="proj">
	       <x:sub_project_of resource="http://www.daml.org/daml"/>
	       <x:participant ref="p1"/>
	    </x:Project>
       </rdf:Description>
       <rdf:Description ref="p2">
	    <x:Project var="proj">
	       <x:sub_project_of resource="http://www.daml.org/daml"/>
	       <x:participant ref="p2"/>
	    </x:Project>
       </rdf:Description>
       <rdf:Exists var="pap" type="x:Publication">
	  <x:author ref="p1"/>
	  <x:author ref="p2"/>
       </rdf:Exists>
       <rdf:Exists var="i1" type="x:Employer">
	 <rdf:Exists var="i2" type="x:Employer">
	    <rdf:Description ref="p1">
	       <x:employed_by ref="i1"/>
	    </rdf:Description>
	    <rdf:Description ref="p2">
	       <x:employed_by ref="i2"/>
	    </rdf:Description>
	    <rdf:Not>
	       <rdf:Description ref="i1">
		   <rdf:equal ref="i2"/>
	       </rdf:Description>
	    </rdf:Not>
	 </rdf:Exists>
       </rdf:Exists>
   </rdf:Query>
</rdf:Query>

"How many objects are mentioned in exactly one DAML statement in the database at institution Z?"
This is not the same sort of query as the others. Our query language won't work, because we're asking about the syntactic structure of the database itself, not about the world the database describes. (Hey, a job for reification!) Even so, this kind of query could be important, because we would expect a "rich" knowledge base to have a lower percentage of objects that are mentioned in just one place. I'll come back to how we handle these queries below.

Efficiency issues:

Answering queries like those above is basically a matter of finding all the objects that satisfy a part of the query, then filtering out those that don't satisfy the rest. The time required to answer the question can depend strongly on issues like which conjunct we work on first. Rather than talk about such "query optimization" in general, we have to focus on the specific issues raised by the fact that our "database" doesn't exist as an entity in one place, but is rather a distributed collection of statements found on the Web.

Strictly speaking this fact makes the query-handling problem unsolvable. For instance, the query about the researcher who is furthest from Washington can be answered only by finding a list of all the researchers and seeing which is the furthest. (Or possibly by finding an explicit statement that someone is the most remote researchers --- assuming you can trust that statement.)

I'm going to assume that this obstacle won't be unsuperable. We just have to tell the user whenever the query engine "proves" a universal statement by enumerating all the instances it knows about. The resulting nonmonotonicity will mean that the output will be useless for some purposes (such as enforcing a legal contract), but for some purposes having the query engine give its best guess will be much better than nothing.

The problem of answering queries in a web context is not that much different from the problem solved by Yahoo and Google every day, except that we have more precise definitions of what we're looking for. Web search engines operate by walking randomly through the web finding pages and indexing them. Then, when the user asks for some keywords, the indexes allow the URLs of those pages to be retrieved quickly. Instead of keywords, we will be indexing classes and predicates.

Given one of the complex queries above, we have a choice of what to search for first. Let's take the example of finding two researchers from different institutions that have collaborated on a paper. As usual, the query is essentially an existentially quantified conjunction. (Universal quantifiers in queries are more difficult to handle, but rarer.) If we break the conjunction down into its pieces, we encounter searches for several basic patterns:

Find all subprojects of DAML ()
Find all projects with people participating in them.
Find all publications and their authors.
Find all people and their employers.
Find all pairs of constants that are not equal.

Of these, only the first makes any sense as a starting point. One way to determine that would be to actually consult the indexes. The system probably won't even try to construct all the pairs of unequal constants, but it might try for a while to keep track of projects, people, and publications, but eventually realize that these sets are too big to keep track of explicitly. Or it might just use the heuristic that a query must have at least one constant (here, the URI for DAML) to be useful as a starting point.

Unfortunately, the query engine is not likely to keep track of all subprojects of DAML explicitly either. It might have a list of all immediate subprojects, but it will also have a rule:

    <rdf:Forall var="p1" type="x:Project">
       <rdf:Forall var="p2" type="x:Project">
	  <rdf:Forall var="p3" type="x:Project">
	     <rdf:If>
		<rdf:Antecedent>
		   <rdf:And>
		      <rdf:Description ref="p1">
			  <x:sub_project_of ref="p2"/>
		      </rdf:Description>
		      <rdf:Description ref="p2">
			 <x:sub_project_of ref="p3"/>
		      </rdf:Description>
		   </rdf:And>
		</rdf:Antecedent>
		<rdf:Consequent>
		   <rdf:Description ref="p1">
		      <x:sub_project_of ref="p3"/>
		   </rdf:Description>
		</rdf:Consequent>
	     </rdf:If>
	  </rdf:Forall>
       </rdf:Forall>
    </rdf:Forall>

So the system can't even realize how big the set of subprojects of DAML is unless it uses this rule.

I don't think the obstacles here are that daunting. The hardest part is to avoid starting an infinitely expensive search with little return. The query engine must be able to engage in a dialogue with the agent that initiated the query to find out how much he or she (or it) is willing to spend to get the answer.

The only thing difficult about handling the last query above, about how many assertions mention just one statement, is that it is at a syntactic level. Answering queries about the syntax of a collection of DAML statements themselves requires a set of predicates about the structures of such statements and collections. These predicates may be defined in terms of rules that ultimately examine the contents of individual assertions. If someone asks about the collection of all DAML expressions in the world, they must be willing to pay a lot for an unreliable answer. But if the query concerns a particular site or in some other well-defined collection, then enumerating all the expressions in the collection is no different from enumerating all the subprojects of DAML.

Security Issues

We were asked how the system would work if not everyone had access to all the sites involved. The question is puzzling, because it seems to have an obvious answer: It won't work. The only hope is something along the following lines:

Assume that the query engine has the ultimate security clearance, so that the indexes it constructs include access to everything. However, for every piece of data it points to, it records whether there are special security concerns for the data. When it gets a query that causes it to touch the sensitive data, it checks to see if the questioner's security clearance matches the level of security attached to the data. If not, it warns the questioner that the answer has possible incompletenesses that it can't tell the questioner any more about. (Perhaps even that is telling the questioner too much; I am not enough of an expert on security to know.)

This scheme would work in principle, but no one who cares about security will ever agree to it, not without considerably better guarantees about the security of the query engine than are now possible. It's possible that we could apply DAML technology to a special secure net within which you could see all the data normally accessible, plus a whole lot more. In that regime, we wouldn't even allow someone to ask a question (or even touch the network) unless they had a high security clearance).