Why eXist should be in every digital humanist's toolkit

Erik Simpson

    Chances are that if you’re in the digital humanities, you either use TEI or some other flavor of XML to store all of your data, or your project uses XML in some key areas. If you use XML, then eXist should be in your toolkit. Why? Well, as you already know, XML is a fantastic way to encode and annotate scholarly data and metadata, but without a database to store it, a web server to publish it, or a search engine to analyze it, your project may fall short of its potential. eXist does all of the above: It’s a fast web server, a powerful database, and a full-featured search engine. (To contrast it with other tools used in digital humanities work, eXist isn’t a content management system like Drupal or Omeka, or a digital object repository like Fedora; it’s more of a database and an application server that can be adapted to your project’s needs.) It’s free, built on open standards, and continually improved by the open source community. It runs on Macs, PCs, and Linux and is easy to install; you can install it anywhere from your netbook or laptop to a desktop computer or a dedicated server.

What does eXist really do with your XML? At its core is the following process: You give it your XML files, and eXist happily stores and indexes it; the files immediately become available for search and retrieval. Then you use “queries” to search within the documents, organize them into collections, and analyze, transform, and publish your data. You can limit eXist to being an XML storage facility that your existing web server draws content from, or you can store your entire web application in eXist (CSS, Javascript, images, and all), and make eXist your project’s website. 

While nothing this powerful could be trivial to learn and use, eXist is entirely feasible to dabble in (or even master) for someone with a humanities background. You or your colleagues will need to learn a language called XQuery, a language designed expressly for the purpose of working with XML. But fear not: XQuery is a high level language that abstracts most of the programming away, and lets you focus on extracting the information you need from your XML. (See below for how to try live examples.) There are excellent resources for learning eXist and XQuery, including a vibrant community of users, many of whom work on humanities applications. In fact, eXist is so flexible and well-suited to the work of the digital humanist that XQuery could be the first and last computer language you’ll ever need to learn. For all these reasons, digital humanists should see eXist as an absolutely essential tool.

One of the most direct ways toget a sense of what functionality and power eXist offers digital humanities projects is to visit eXist’s homepage and browse to eXist’s XQuery Sandbox. The Sandbox contains sample texts (Hamlet, Macbeth, and Romeo & Juliet) and canned queries that you can try, alter, and play with. Find the “Paste Example” drop-down menu, and select the first item: “Simple full text query on the Shakespeare plays.” You’ll see that the query window will populate with the following:

//SPEECH[ft:query(., 'love')]

This query instructs eXist to show all speeches (SPEECH elements) that contain the word “love”—but for now let’s set aside the semantics of the query, and get to the results. Click on the “Send” button. Watch the results of the query stream back to you in the bottom results window. Notice how the word “love” is highlighted in the results to help you see the matching text. (Here’s what the syntax means: //SPEECH asks for all speech elements, and the square-bracketed expression filters or restricts the results to just those that have a match in eXist’s fulltext index for the word “love”. It’s okay not to understand every query now; it’s time to play and experiment.)

Let’s experiment! Try changing the word from “love” to another word (say “cold”), and hit “Send” again. Change the word to bird\*, and notice how the search now returns hits with “bird,” “birds,” and “bird’s”—the asterisk is a wildcard for the ft:query() function. Now try each of the next few options in the drop down menu. By the time you see the 4th option, “Show the context of a match,” the real power of XQuery becomes evident: We’re still searching speeches, but now the results of your search show each speech’s scene, act, and play. This is possible because eXist understands the hierarchical structure of XML, and can use that structure to enhance your search results. You can try as many of the queries as you like. Don’t worry, you can’t do anything wrong here, and even if you did, the eXist homepage resets itself every several hours.

If this demonstration piques your interest and strikes you as having potential for your project, here are 5 steps you can follow to download and install eXist onto your own computer and get working with your own data.

    Download eXist: Go to the eXist homepage, click on the big “Download,” and look for the section entitled “Stable Release.” If you are running Windows, download the version ending in “.exe.” Otherwise, if you’re running Mac or Linux, download the version ending in “.jar.” The file that downloads is the eXist installer.  (Note to Windows or Linux users: Before you can install eXist, you need to download and install the Java JDK.)

    Install eXist: Once the file is downloaded, double-click on it to start the eXist installer. Follow the prompts to select an installation directory on your hard drive, and choose a password (or leave the password blank for now). The default choices that the installer provides you with are all acceptable. (Once you’ve finished installing eXist, if you navigate to the folder where you installed eXist, you’ll see about 50 files and folders. Keep them all for now, and you can mostly ignore them.)

    Start eXist: eXist is different than many applications on your computer, and starting eXist is your first indication of this. When you start eXist, you’ll notice that it’s actually more like a service that runs quietly in the background rather than an application with its own windows and graphical interface; in fact, you usually interact with eXist through other programs, like your web browser. So let’s get it started. Starting eXist on Windows is pretty straight-forward; you’ll find an icon on your desktop called “Start eXist”; double-clicking on this icon will launch a command line window and display a cryptic log of eXist’s startup routine; and just keep this window open. On Linux or Mac, though, you’ll need to open a command line: On Mac, go to Applications > Utilities, and start Terminal. Then use the “cd” command to navigate into the folder where you installed eXist, and type “bin/startup.sh”. You’ll see the log of eXist’s cryptic startup routine, and again, just keep this window open. The contents of this log aren’t important for now, but you should see it advance pretty quickly, until it halts with a message like, “Server has started on ports 8080.” If you see that, you’re golden.

    Take eXist for a spin: Now that eXist is running, you can begin interacting with it through your web browser. Open your web browser to http://localhost:8080/exist/, and you’ll see a page very much like eXist’s homepage. (Note: This link only works when your eXist is running. The “localhost” bit means your own computer, and the 8080 bit is a “port” that eXist runs on by default; if this bothers you, don’t worry, since it’s not hard to change eXist’s configuration so you don’t need to type 8080. For now we’ll stick with 8080.) In fact, it is identical to eXist’s homepage, since eXist’s homepage is run, naturally enough, on eXist. Now that eXist is running on your own computer, you don’t have to be on the internet to explore eXist. (You’ll never be bored on a train or plane again.) I’d suggest clicking around a bit to get acquainted with eXist: from the homepage, you’ll find a like to the “Main Documentation,” the “Feature Sheet,” and the all-important “Admin” page. The Admin page will ask you for your username (“admin”) and the password you chose during the installation process, and from here you can perform many useful tasks. For example, you can install the example Shakespeare files and the sample Sandbox by clicking on “Examples Setup” and then “Import Files.” If you want to search eXist’s documentation, you can install it by clicking on “Install Documentation” and then “Generate.” Once you’ve installed the examples and the documentation, it’s instructive to click on the “Browse Collections” panel to see the data you’ve just added to the database: the Shakespeare data is in the “shakespeare” collection, and the the Sandbox example queries are in the “example.xml” file. The root collection is called “db,” so the full path to this file is “/db/example.xml.”

    Add your own data: eXist really starts to shine when you add your own data to the database and begin writing queries on your data. There are several ways to upload files to the database, but we’ll start with one simple way. From the Admin page (see step 4), click on “Browse Collections.” Let’s create a new collection for your data. In the “New collection:” field near the bottom of the page, enter “mydata”, and click “Create Collection.” Notice that the new “mydata” collection appears in the listing. Click on the “mydata” collection. It’s empty, so let’s add an XML file. Click on “Choose File,” browse to one of your XML files (if you need one, download more Shakespeare), and click on “Upload.” Notice that the “myfile.xml” is now in the list of files. You can even upload non-XML files, and while they’re not searchable like XML, eXist happily stores them. Now that your data is in eXist, you can return to the Sandbox and begin querying it. It’s unlikely that your data matches the structure of the Shakespeare data, so you’ll need to experiment with your own queries (note that the ft:query() function in the first Sandbox queries above may not work on your data until you’ve added full text indexes to your data; instead, try contains(). To browse through all of the functions like this built into eXist, these are on eXist’s homepage under Function Library or on your local copy of eXist.) If you’re ready to turn your Sandbox query into a webpage with its own URL, save the text of your query to a file ending in “.xq” (e.g. “myquery.xq”) and upload it to your collection; then enter, for example, http://localhost:8080/exist/rest/db/mydata/myfile.xq. If you hit a roadblock, don’t despair. This is a good time to explore online resources for learning XQuery, like the XQuery Wikibook. Priscilla Walmsley’s XQuery (O’Reilly 2007) is a great reference book too. Remember too that you’ve got all of the eXist documentation in your browser, browsable and searchable. Now is a good time to join the eXist-open mailing list (search or subscribe) for answers to your questions about eXist, and the XQuery-talk mailing list (search or  subscribe) for answers to your questions about basic XQuery.

I hope this helps give you a taste of what eXist could offer your digital humanities project, and whets your appetite for more. Questions? Comments?

(This post was inspired by coffee break and hallway conversations I had at the Chicago Colloquium on Digital Humanities and Computer Sciences 2010 meeting. See the tweets:  #dhcs10.)

Update: This post was migrated from my old posterous.com blog in Dec. 2014, thanks to the Wayback Machine’s copy.

Also, for posterity, I’ve adapted a few comments I was able to retrieve:

    Erik Simpson (December 9, 2010): Thank you for this fine post. As someone just getting into this field, I’m trying to understand the terrain: is eXist an alternative to XSLT, or do they have different functions?

    Joe Wicentowski (December 9, 2010): Great question.  The short answer is that, actually, eXist and XSLT are fundamentally complementary technologies, and it could even
be argued that eXist enhances XSLT.  Why?  First, eXist natively supports XSLT, and does so in some very cool, unique ways.  The most straightforward way is that you can use eXist to apply XSLT stylesheets to your TEI (or your favorite flavor of XML) documents. But since all of your XML is stored in the database, you could easily apply a stylesheet to an entire document, a fragment of a document, or an entire collection of documents.  You could also apply the stylesheet to just the brief snippets of a fulltext search result. This is a different model than one in which you “transform all of your documents to HTML and upload them to the web server.”  eXist is the web server and the database, so it is very flexible about letting you keep your XML intact and pull out just the fragments of documents (or entire collections) that you want to work with for a given purpose, on the fly, without “generating the entire website” in advance or “shredding the document” up for a search engine.  This is one advantage of working with a “native XML database” like eXist.

    Another very cool thing about eXist’s XSLT support is that you can use it to create flexible “URL rewriting” rules and pipelines.  You could design a pipeline so that “book1.xml” (which consists of TEI divs with IDs of section1, section2, section3, etc.) is accessible on your website via a nice URL like http://mysite.com/book1/section1.html. This section1.html isn’t a file in a physical directory called book1.  Instead, eXist interprets this URL as asking for (a) the div inside of book1.xml with ID section1, with (b) the “html” XSLT stylesheet applied to the div.  In other words, you can custom craft your URLs to look very clean but in fact apply XSLT stylesheets in some sophisticated ways.  So in these ways, you keep all the power of XSLT (and can continue using XSLTs you already have), and extend it by virtue of the fact that you have an XML database.

    I, for one, began using eXist in just this way: I used the standard TEI community stylesheets that I had customized, and I used URL rewriting to interpret URL requests to retrieve just the desired XML for my needs.

    After some time, as I got more and more comfortable with XQuery (since it’s what you use for the logic of your web sites or web services in eXist), I found preferring to work in XQuery rather than XSLT.  So I decided to go 100% pure XQuery, and I re-wrote my XSLT stylesheets as XQuery.  There’s a growing community of folks who use XQuery for their XML transformations; see the last link below.  You might even hear people arguing that one or the other is superior; but in my experience, you can use whichever you’re most comfortable with.  eXist doesn’t really care which you use for your document transformations!

    Final note: the XQuery module used to invoke XSLT stylesheets and apply them to XML documents is the “transform” module.  See http://demo.exist-db.org/exist/functions/transform.  There’s also a good article on using XQuery for your document transformation, much in the way XSLT is used.  See http://en.wikibooks.org/wiki/XQuery/Typeswitch_Transformations.

    Erik Simpson (December 10, 2010): Thanks very much for your detailed and helpful reply. I look forward to doing more with this!

    cortezthekiller (July 12, 2012): I found your post very interesting. I know the basics of some of the XML technologies (XML Schema, XPath, XSLT, XSL-FO) and I would like to start using XQuery with XML databases. eXist looks like the right choice, but it seems to me quite a bit difficult thing to learn … Do you know about some reference books (O’Reilly, etc.) besides the eXist official website?

    [I haven’t been able to locate my response yet.]

    cortezthekiller (July 12, 2012): Thanks for all these references. I’m trying to use XSLT for transforming XML Docbooks into PDF/HTML/ePub. I thought that it would be a good thing to use a database and launch queries for indexing the documents. At first I was thinking to use PHP & MySQL to develop my own web application, but MySQL has very poor XML support. I’ll take it easy and start with eXist & XQuery after the summer.

    Joe Wicentowski (July 13, 2012): eXist-db is great for creating dynamic web applications for searching transforming DocBook with XSLT. Let us know on exist-open if we can help.

    cortezthekiller (July 14, 2012): I wonder if the Higgs Boson is made of XML, too :) One last thing: do you recommend some IDE for eXist-db? I’ve seen there’s an Eclipse plugin and one IDE named eXide. Thanks a lot.