☯ joewiz.org

A place to call home

A preview of XQuery 3.1's JSON support in eXist

18 January 2015
Tags: exist, xquery, json, and api.

In my mid-2013 article, “Living in a JSON and OAuth World”, I discussed the key challenges of talking to web APIs like Twitter and Tumblr from eXist and XQuery, namely:

  1. Authenticating with OAuth 1.0a
  2. Talking to the increasingly JSON-centric world of web APIs

Well, in the time since that post, these tasks have gotten a lot easier. Let’s review what’s changed.

First, OAuth 2.0 did away with the need for cryptographic request signing and intricate parameter ordering (relying on SSL’s inherent security instead), drastically simplifying the task of talking to web APIs. Having only worked with OAuth 1.0a before, I kept thinking that I had to be missing something when I read GitHub’s OAuth API docs this weekend to implement an OAuth client in eXist; I’d authenticated in too few steps, and could submit requests without cryptographic signing? Easily a 50-75% reduction in code and complexity. My, what a boon.

Second, dealing with JSON is about to get far simpler. The xqjson library that I’ve been maintaining has been limping along, buoyed by community contributions and yet plagued by a bug and the limits of its memory-bound performance. It continues to work for the applications I originally selected it for—parsing Twitter and Tumblr posts as seen here—and I’ve invested some time in it, adding a comprehensive test suite. But feed it a JSON file like this 40 MB behemoth (titles and metadata for the 25,000 recently-released early English books from the Text Creation Partnership), and on my system, at least, you’ll encounter a java heap space error. It’s been great, but these limits made me feel uneasy about relying on it for the long term.

For some time I’ve been hearing that the W3C’s XQuery Working Group was working on a better solution to JSON in XQuery 3.1, currently a Candidate Recommendation, with accompanying functions and operators spec. Whereas the xqjson approach was to take all JSON as a string and convert it to XML first before any processing could take place, XQuery 3.1 can natively deal with JSON objects as JSON.

For example, instead of writing code like this to extract the text of tweets from JSON:

xquery version "3.0";

import module namespace xqjson="http://xqilla.sourceforge.net/lib/xqjson";

let $tweets := xqjson:parse-json(util:binary-to-string(util:binary-doc('/db/user_timeline.json')))/item
for $tweet in $tweets
return
    <tweet-text>{$tweet/pair[@name = 'text']/string()}</tweet-text>

… now we can simply write this:

xquery version "3.1";

let $tweets := json-doc('/db/user_timeline.json')
for $tweet in $tweets?*
return
    <tweet-text>{$tweet?text}</tweet-text>

No external module import, fewer function calls, more streamlined syntax, and direct access to JSON objects. XQuery 3.1 is actually chock full of innovations and labor saving features, but isn’t a full recommendation yet. What to do?

Luckily, Wolfgang Meier has begun work on a new branch of eXist with support for XQuery 3.1 in its current form. He’s selected the Jackson libary to provide the underlying parsing and serialization of JSON. And he’s done a fantastic job adding XQuery 3.1’s JSON facilities to eXist. Remember that 40 MB JSON file that caused xqjson to run out of memory? Wolfgang’s XQuery 3.1 branch of eXist can parse it in half a second. In other test where xqjson succeeds in parsing the JSON, Wolfgang’s branch beats it hands down.

I’ve posted some code snippets at GitHub illustrating how to put XQuery 3.1’s JSON support to use. The first example, parse-tweets.xq, shows how to parse JSON-encoded tweets, extract key fields, and transform them into XML or serialize the transformation as JSON. The second example, json-util.xqm is a library module containing utility functions for working with JSON—including pretty printing JSON and transforming JSON into the intermediate XML format defined in the XSLT 3.0 spec. With XQuery 3.1, you can take JSON, turn it into XML, create JSON from scratch, sort and manipulate it, and post it to an external API or JSON document store like MongoDB.

Speaking of MongoDB, we can actually add it to this preview of support for the JSON ecosystem in eXist and XQuery. Dannes Wessel’s brand new Mongrel project allows eXist to talk to MongoDB and GridFS. Dannes promises to adapt his implementation to XQuery 3.1 when it’s out. I’ve taken it for a spin, querying MongoDB using Dannes’s excellent Mongrel documentation. MongoDB, via Dannes’s Mongrel project, is a perfect complement to eXist’s XML database—allowing you to store, query, transform, map-reduce JSON to your heart’s content. Furthermore, GridFS via Mongrel gives you the ability to store (or cache) static or binary files much as one would with Amazon S3, in a scalable, responsive way; this could correspondingly relieve eXist of the burden of storing and retrieving these files.

Both Wolfgang’s branch and Dannes’s project are in active development, and I look forward to seeing them formally released, put to the test, and used in the real world.


eXist: the indispensable guide

28 December 2014
Tags: exist and xquery.

eXist has powered history.state.gov, my office’s public website, since its launch. I discovered eXist seven years ago, in 2007, and I still use it everyday in my work. I have watched the software grow both in power and ease of use, and I have taught others how to use it in the fields I work in: digital humanities, scholarly publishing, and government. But in both learning and teaching eXist, I have long lamented that eXist lacked a proper book of its own: a well-conceived and executed on-ramp for new users and a comprehensive guide for practitioners at all levels. Adam Retter and Erik Siegel have given us just that in eXist: A NoSQL Document Database and Application Platform, from O’Reilly.

eXist before eXist

In 2007, the stars were in alignment for beginners like me to learn and exploit eXist. That January, the XQuery programming language had achieved full 1.0 status as a W3C Recommendation, signaling that the language was ready for prime time. This high level standard built specifically for querying and manipulating XML largely abstracted away the underpinnings of computer systems, allowing someone with understanding of and need to work with XML to become proficient—even without the computer science background traditionally needed to make computers do one’s bidding. That June, the standard bearer of technology publishing, O’Reilly Press, published Priscilla Walmsley’s comprehensive introduction to XQuery. And XQuery was already available in a number of software packages. Among these, the oldest and most mature package, developed with the needs of the TEI and digital humanities communities in mind, was eXist.

With Walmsley’s manual on the XQuery language on my bookshelf, all I needed was a guide for developing XQuery-driven applications with eXist. I needed an on-ramp, a reference for true beginners like me. eXist came chock-full of technical documentation geared toward seasoned software developers, but it was sorely lacking in tutorials or guides for beginners. Besides an interactive tool for testing simple queries against an XML edition of several Shakespeare plays, the closest thing to a tutorial on creating a full website was the source code driving eXist’s own documentation site. But expecting a beginner to learn to program basic database-driven web apps by reading source code has about as much chance of success as a kindergartener learning arithmetic from a calculus textbook.

Despite lacking a book, I and many others have thrived with eXist, thanks to its vibrant and knowledgable user community. Many of my days at work that first year ended in frustration and a desperate email to the mailing list inquiring about the roadblock I had hit, but by the next morning I had received a reply explaining the solution. Also, building upon our early successes, my colleagues and I secured resources to receive in-person instruction from experts in XML technologies, such as Dan McCreary and C.M. Sperberg-McQueen, as well as the assistance and guidance of eXist’s core developers, including creator Wolfgang Meier, co-founder of eXist Solutions consultancy, which supplies enterprise support and development services. Community members, myself now included, do their best to improve the project’s documentation, write tutorials, lecture, and answer questions on online forums. And eXist’s own documentation and facilities for learning have improved drastically since 2007.

Still, as helpful and vital as these resources were and are, the absence of a book has loomed large. Given a field (in this case, a technology stack) as powerful and potentially complex as eXist, even seasoned practitioners need a reference guide to areas outside their expertise. And beginners need a straightforward text upon which to build the foundation of their knowledge.

The book

Finally, with the publication of Retter and Siegel’s eXist, we have that on-ramp, that practical companion to Walmsley’s XQuery1, to guide you in applying XQuery and XML to develop real-world desktop or web applications, soup-to-nuts. Far from just a beginners guide, its ambitious, comprehensive, even encyclopedic coverage of core through advanced aspects of eXist will earn a lasting space on your bookshelf.

The first chapters walk you through download and installation of the software, offering tips for every major platform eXist supports—Mac, PC, and Linux. It explains how to navigate the built-in documentation and resources, how to get data into and out from eXist, and how to connect eXist to popular tools for XML and XQuery work, such as oXygen. By the end of Chapter 3 (“eXist 101”), you’ll have built a searchable, browse-able website around a collection of Shakespeare plays encoded as XML. These lessons and apply it to their own project as a simple proof of concept. On-ramp? Check.

In the remainder of the book, Retter and Siegel methodically survey all aspects of eXist, offering material of both immediate utility and long term reference value. Far from a dry technical catalog, the authors identify the best practices that have emerged from a broad consensus of eXist users. These chapters can be read out of order, as driven by the reader’s needs during a project’s lifecycle. Essential chapters cover how to use eXist’s various indexes to speed queries, how to craft queries for maximum efficiency, and how to configure the server and troubleshoot problems. Another chapter explains how to use eXist’s unix-inspired permissions system to control user access to resources and code, again with compelling examples based on a publishing workflow with disgruntled employees and semi-trusted external partners. Another provides a sober audit of eXist’s attack surfaces—aspects of the software that need to be given special consideration when moving eXist from a desktop system to a public server on the Internet. Throughout, the book provides better diagrams and more comprehensive descriptions of eXist’s internals than eXist’s own documentation, often filling in the gaps where no documentation existed in the first place. If some of these examples sound esoteric, rest assured that at some point when you are using eXist, you will need to use this information yourself or provide it to someone (e.g., a system administrator) who will. It’s all there, along with pointers to additional resources.

As much as I wish this book had been available when I first used eXist, the book I had hoped for in 2007 would certainly have needed a drastic revision by 2014. Many features in eXist have been dramatically improved in the past several years. These include the addition of: a free browser-based XQuery editor called eXide, a Apache Lucene-backed full text and range indexing system, a tightening and comprehensive upgrade to the security and access control system, two URL rewriting and API design frameworks, the betterFORM and XSLTForms frameworks for building interactive forms, and a data replication system based on ApacheMQ for spreading data across many servers instead of just one (addressing concerns about eXist’s scalability and “single point of failure”). At the same time, eXist has morphed from a rather bloated assortment of community contributions atop the database core, to a streamlined system with modular extensions and a packaging system for libraries and applications. The book treats these additions not as an appendix but as part of the comprehensive introduction and guide to eXist, circa 2014.

Speaking of progress, an important caveat for readers is that the book covers eXist version 2.1, but version 2.2 was released just last month. Unlike some software you may use, whose frequent updates are accompanied by whole number version upgrades, a point release upgrade like this 2.1 to 2.2 is a major event for eXist. The new version remains compatibile with the old one, but the new one offers significant refinements and additions. For example, version 2.2 did away with the “old” web-based server administration tool, which the book refers to in several places for key functions, and replaced it with a new application called “Monex,” which offers these functions in addition to many new features and a slick new interface. One of Retter’s own additions to 2.2, SetUID and SetGID, overcomes a limitation of the security model in 2.1 and earlier mentioned in the book, but the new feature couldn’t be included in the book. The book is clear that such change is to be expected, but readers may be confused or, perhaps worse, unaware of some of the changes that the book could not take into account. In time, assuming the book is sufficiently successful in the marketplace, the authors will surely account for these changes in a new edition. But in the meantime, readers will have to monitor announcements from the eXist developers about these changes. Such is the price of progress in the open source community, where raw enthusiasm and itch, not just corporate priorities, drive rapid evolution.

eXist is a pleasure to read. The authors write in clear, plain English and employ humor judiciously. The book offers insights into how eXist works and how to get things done in eXist available nowhere else. Complete versions of the code introduced in the book are available for free download on GitHub. The code samples are compelling, not perfunctory, and worth downloading and exploring. The index is comprehensive. The electronic edition (DRM-free when purchased direct from O’Reilly) is exceptionally well-produced, with few typos that I’m sure will be fixed shortly—at least in the electronic editions. O’Reilly offers its typical bundle discount for buying the print and electronic editions together, or, if purchased separately, a discount for “upgrading” from one edition to get the other.

The eXist community owes Adam Retter and Erik Siegel, as well as the book’s contributors and reviewers a huge debt of gratitude. Writing and producing it surely took thousands of hours of labor (not including the decades of combined experience that shaped the contributors’ perspectives), even though authors of technical publications know their efforts aren’t likely to yield direct returns close to the time invested. Perhaps this book will be different. The book should be assigned in classrooms and digital humanities workshops. It is essential for anyone looking to learn eXist. For many young software developers and scholars, eXist offers a unique and compelling set of capabilities, and this book will help them harness its power to build great things. I look forward to the next seven years of eXist.

  1. Retter and Siegel are clear that their book is not an introduction to XQuery. They recommend that all eXist users still buy Priscilla Walmsley’s XQuery as an introduction to the language—and I second this recomendation. A second edition of XQuery is scheduled to be released in 2015, with coverage of the 3.0 (hopefully even 3.1) version of the language.


A Google Chrome bookmarklet for fixing URLs with DNS errors

16 December 2014
Tags: chrome and bookmarklet.

Bookmarklets are great. They’re just like any bookmark you add to your web browser, except they contain a snippet of Javascript that runs when you click on it. One of the first bookmarklets I used was for Instapaper; it passes the window’s current URL to the services so it could save the page for you to read later.

Today I had a burning need for a new bookmarklet. I received an email with a link to a URL redirect service that was located behind a corporate firewall, inaccessible from the public internet. The URL was something like this:

http://redirect.company.blah/?url=http://www.google.com/

If you try to open this link, your browser will naturally tell you that “This webpage is not available.” Fixing the URL is easy enough: just chop off everything before the ?url= bit. But doing this over and over struck me as a chore that a bookmarklet could take care of. I’d written a few convenience bookmarklets in the past to switch between local and remote versions of my website. The javascript in these bookmarklets looks something like this one, which replaces the server name in the window’s URL with localhost:8080:

javascript:(function(){window.open('http://localhost:8080'+window.location.pathname);})();

With the corporate intranet URL, though, I ran right into a roadblock using this approach: window.location.href was returning data:text/html,chromewebdata instead of the URL in the location bar. The problem was that Chrome’s DNS error page had usurped the window’s location! It seemed that the original URL was nowhere to be found in the DOM in window.* where I was used to looking.

But then I noticed that the text of Chrome’s error page contained the original URL. While View Source was of no help, I brought up Google’s Developer Tools (right-click on a web page, and select Inspect Element) and dug into the actual source code of the error page. I found the full original URL in a Javascript variable accessible via templateData.summary.failedUrl. Typing this phrase into the Developer Console returned the original URL.

templateData.summary.failedUrl

… returned:

http://redirect.company.blah/?url=http://www.google.com/

Bingo! Now I just needed strip off everything before the ?url= parameter. Regex to the rescue!

templateData.summary.failedUrl.replace(/.*\?url=/,'')

… returned:

http://www.google.com/

With this in hand, I was able to create my bookmarklet:

javascript:(function(){window.location=templateData.summary.failedUrl.replace(/.*\?url=/,'');})()

If you’d like to use this bookmarklet too, just drag the following link into your bookmark bar:

de-redirect

(Note: Jekyll appears to be escaping some of the characters in this bookmarklet. It still works, but if you avoid %-encoded characters when possible as I do, just replace the bookmarklet’s URL with the sample code above.)


Filling PDF forms with PDFtk, XFDF, and XQuery

13 February 2014
Tags: XQuery, eXist-db, PDF, PDFtk, and XFDF.

It’s hard to believe that PDF is still used for forms these days; they seem so last decade, or even century, compared to what’s possible on the web. I dread the tedium and boredom of filling out a PDF form with tons of fields or creating multiple editions of a form with different information on each.

To illustrate the pain of a PDF form, imagine you have a single blank form—take NARA’s SF702 form. This is a good example, because it contains so many fields. Some of the fields appear to be intended to be filled out with a pen, but it would save time if you could pre-populate the form with the days of each month. But a year’s worth of forms would mean 365 fields. Tab through the form, noting how the cursor jumps through the fields out of order. Entering these forms with a program like Adobe Acrobat is an RSI-inducing nightmare. You’d be tempted to give up and use a pen, which rather defeats the purpose of an electronic form.

What if you could generate and organize your data first and then have a program do the tedious form filling for you? You might know how to generate dates with a spreadsheet or with a programming language like XQuery. But how can you get data onto the form? How can you automate the task of filling in the data? Here’s how.

The key resource that got me started was a very helpful article called “Fill Online PDF Forms” by Sid Steward. The article introduced Sid’s tool, PDFtk—a free command line tool for manipulating PDF files—and showed how to extract information about the fields in a PDF form and use this information to generate a Forms Data Format (FDF) file with your data and apply it to the PDF. I downloaded the installer for my Mac (it works on Windows and Linux too), installed it, and got to work extracting the fields from my PDF form, sf702.pdf, with this Terminal command:

$ pdftk sf702.pdf dump_data_fields > sf702-data-fields.txt

This command tells PDFtk to look for the data fields—the text boxes or other input widgets on the form—from the PDF and “dump” them into a new file, sf702-data-fields.txt.

If you look at the data fields file, you’ll see a series of entries like this, one per data field that PDFtk identified:

---
FieldType: Text
FieldName: form1[0].#subform[0].MONTHYEAR[0]
FieldNameAlt: MONTH/YEAR. Enter 2 digit month and 2 digit year.
FieldFlags: 8388608
FieldJustification: Left
---

Of these five entries per field, the key entry for our purposes is the second one, FieldName. You’ll use FieldName to tell PDFtk which field to insert each piece of your data into. (I’d suggest that you compare the order of the entries in the data fields file with the order of the fields in the original form by pressing the tab key in your PDF viewer to advance from field to field; the order should be the same, which is quite handy when finding poorly named field names.)

As Sid’s article explains, once you know the FieldName values for your form, you then need to generate a Forms Data Format (FDF) file that contains the values for each field you want to fill in on the form. An FDF file is just a text file that a program like PDFtk can use to fill your data into the correct field. So what does FDF look like? Sid provided the following example, with two fields (city and state) and two corresponding values to fill in (San Francisco and California):

%FDF-1.2
%aaIO
1 0 obj
<< /FDF << /Fields [	<< /T (city)  /V (San Francisco) >>
                                 << /T (state) /V (California) >> ] >>
>>
endobj
trailer << /Root 1 0 R >>
%%EOF

Yuck! A nasty, domain-specific format! Sid kindly provided a PHP-based forge_fdf script to ease the pain of creating an FDF file, but I didn’t want to muck with PHP; nothing against PHP, but I wanted to use XQuery, the programming language I happen to know best. I knew I could write a script to generate the calendar and other data, but the sample FDF looked awful, and the thought of generating FDF with XQuery nearly made me give up.

As one final effort, I searched Google for “FDF XML” and was excited to find an XML edition of FDF, called XFDF, the XML Forms Data Format. Instead of needing to generate something like the FDF sample above, I just needed a nice, clean XML document in the XFDF namespace, containing field elements and a simple name-value structure like this:

<xfdf xmlns="http://ns.adobe.com/xfdf/">
    <fields>
        <field name="city">
            <value>San Francisco</value>
        </field>
        <field name="state">
            <value>California</value>
        </field>
    </fields>
</xfdf>

Anyone who has used XML could easily code this by hand, but if you add a little XQuery knowledge, you could dynamically generate fields and values easily. And best of all, Sid’s PDFtk tool is just as happy to take XFDF and plain old FDF and apply it to your form.

So I set about creating my XFDF files. I fired up eXist-db, started its built-in XQuery editor, eXide (live demo). Knowing that I would need calendaring functions for my form, I used eXist-db’s Package Manager to install the FunctX library, a library with useful functions, including days-in-month(). Since I needed 12 forms, one for each month in the year, I wrote an XQuery that generated 12 separate XFDF documents. You can see my full XQuery for my specific form, but I present a slightly simplified version here:

xquery version "3.0";

import module namespace functx="http://www.functx.com";

let $year := 2014
    (: Generate the months of year as padded, two-digit values, e.g., January > 01 :)
let $months := (1 to 12) ! functx:pad-integer-to-length(., 2) 
for $month in $months
    (: Craft an xs:date for each month to calculate the days in the month :)
let $first-day := xs:date(string-join(($year, $month, '01'), '-')) 
let $days := (1 to functx:days-in-month($first-day)) ! functx:pad-integer-to-length(., 2)
let $month-year := $month || '/' || substring($year, 3, 2)
return
    <xfdf xmlns="http://ns.adobe.com/xfdf/" xml:space="preserve">
        <fields>
            <field name="form1[0].#subform[0].MONTHYEAR[0]">
                <value>{$month-year}</value>
            </field>
            { 
            for $day at $n in $days 
            let $field-name :=
                if ($n le 22) then 
                    'form1[0].#subform[0].Date' || $n || '[0]'
                else 
                    'form1[0].#subform[0].Date' || $n - 22 || '[1]'
            return
                <field name="{$field-name}">
                    <value>{$day}</value>
                </field>
            }
        </fields>
    </xfdf>

This script generates 12 XFDF elements, one for each month of the year, with fields for month-year (e.g., 01/14 for January 2014) and for each day of the month (e.g., 01-22 for the first column of fields containing 22 days of the month, and 23 forward for the second column containing the remaining days of the month). Notice also that I had to go through effort to generate the correct field names (e.g., form1[0].#subform[0].Date1[0] for the field containing the first day of the month). These were the field names as I found them in the text file of field data that I generated above with PDFtk. Although the field names were complex, at least there was a pattern that I could use to generate them without typing each one.

Once I had my XQuery working correctly, I copied the resulting XFDF documents out of the database onto disk and crafted a bash script to apply the XFDF files onto the form with PDFtk:

$ for FDF in *.xml; do BASENAME=`basename -s .xml $FDF`; pdftk sf702.pdf fill_form "$FDF" output $BASENAME.pdf; done

This takes each XFDF file (e.g., sf702-01.xml with the January form), applies the file to the blank PDF form (sf702.pdf), and saves the filled form as a new PDF using the basename of the XFDF file (sf702-01) plus the .pdf prefix to yield the sf702-01.pdf file for January.

It worked like a charm!

To illustrate some other capabilities of PDFtk, I merged the PDFs together to form a single file for convenience:

$ pdftk *.pdf cat output sf702-2014.pdf

And I “flattened” the form data to save space (over 60% in my case), which also makes the form no longer editable:

$ pdftk sf702-2014.pdf output sf702-2014-flattened.pdf flatten

Jekyll Blogging in the Browser with prose.io

20 August 2013
Tags: writing and jekyll.

In my continuing quest to find good tools for writing for this Jekyll- and Github Pages-based blog (see my last post about writing from the iPhone), let’s turn to the browser.

Imagine this scenario: You’re sitting at a computer and want to write a new post or edit an existing one. You could use GitHub’s web-based editor. It’s quite good. But it only takes a look at how GitHub handles the job of previewing the Markdown source of that same last post to understand that GitHub’s Markdown parsing abilities aren’t great with Jekyll’s YAML metadata (which holds items like the post’s title, date, etc.). Github’s editor succeeds at letting you edit and commit your changes directly to Github, but comes short when it comes to Jekyll Markdown files.

Enter prose, a highly polished and beautifully designed web-based GitHub- and Jekyll-aware text editor. While you can grab the source and write an entire CMS around it (as prose.io developers Development Seed did rather radically for https://www.healthcare.gov/, you can also use the prose.io site itself to edit files in any of your GitHub repositories. In fact, right off the bat when visiting the site, you’re prompted to authenticate via GitHub’s OAuth interface. Once authenticated, you’ll see your list of repositories and can select a file to edit or create a new file.

As you would expect, Prose has a nice Markdown-aware editor and preview tool and lets you commit your changes directly to Github, but the Jekyll-aware functions lay dormant until you choose to edit a Jekyll post or create a new file in a Jekyll repository. A new icon appears in the editor’s sidebar: Raw Metadata. By default this is a dumb text area with YAML syntax highlighting. But it’s actually quite configurable to display just those fields you want to see, like Title, Tags, etc. When creating a new file, prose.io also prepopulates the filename field with the conventional Jekyll date prefix, yyyy-mm-dd. A nice touch.

I first heard about prose.io on episode 54 of The Web Ahead podcast. I discussed the episode in my first post about Jekyll, and it’s worth a listen to learn about Jekyll in general. But the guests on this episode also happened to be Dave Cole and Young Han from Development Seed, the crew that created prose.io. If you want to listen to this section of the episode, the discussion about prose.io starts at 45:20.

Were I to start my site again, I’d probably base it off the prose.io starter project. It includes prose-specific configuration - meaning you don’t have to add this configuration after the fact.

One caveat about prose: If you have started a new file, you will lose your work if you click on the prose.io icon to browse the About link or hit the browser’s back button. Ask me how I know. This is the second time I’ve typed this article in prose.io. So be sure to click on the Save icon. I’ve chimed in on an existing issue about this.


Mobile Blogging with Jekyll

18 August 2013
Tags: writing and jekyll.

One downside of moving to Jekyll (as I recently did with this blog) is that you leave the vibrant ecosystem of mobile blogging apps. Tumblr’s app, for example, let me get thoughts down with my one free hand and post them right away, or edit old posts with ease.

But mourn no more, Jekyll bloggers! Octopage is an iPhone app that lets you edit and create new posts to your GitHub Pages-based Jekyll site. (Despite the Octo- prefix, it works fine with plain old Jekyll and doesn’t presuppose Octopress.) Start the app, enter your GitHub credentials, select your GitHub Pages repository, and you’re ready to begin editing existing posts or creating a new one. Octopage even prepopulates new posts with the requisite YAML front matter for Jekyll posts and has several settings for code block style and autocapitalization that create the right environment for your editing. Built in Markdown preview and syntax guide are nice touches for people like me who are still learning Markdown.

A few nits: I would prefer that Octopage use Github’s official OAuth API for authentication. Entering my username and password into an app feels so 2010. I encountered some display glitches that left me with only 2 lines visible. Making the app universal and thus pleasant to use on the iPad would be great.

Ultimately, I hope more apps add native Github Pages and Jekyll publishing abilities. We need a more vibrant ecosystem for Jekyll. I really hope that the fantastic new Editorial app for iPad will rise to the occasion and add these capabilities. Or perhaps Octopage’s creator will integrate directly with Editorial, something Editorial was built for.

But for now, Octopage gets the job done and does so quite nicely. To prove it, I wrote this entire post in Octopage — with my left thumb while my daughter napped in my lap.


Goodbye Tumblr, hello Github

23 July 2013
Tags: tumblr, github, jekyll, and writing.

I’ve moved my blog from Tumblr to Github Pages. While Tumblr is a capable blogging platform, I had become frustrated with it. I began longing for a way to write more simply and yet with more rigor — with simple, plain text.

The flip side of my appreciation for sophisticated markup vocabularies like TEI is my love of its plain text underpinnings: clean, precise, portable. The plain text-based Markdown format (explained nicely and briefly by Brett Terpstra) is perfect for writing, especially for the web. Github is hardly alone in supporting the increasingly pervasive Markdown, but Github’s Markdown support is solid. That said, there’s more to Github Pages than Markdown support.

As Konrad Lawson has written, Github is a promising platform for writing and collaborating on text of all kind, from prose to syllabi. I’m not new to Github but hadn’t considered this before reading Konrad’s article. Until now I’ve primarily used Github to house my own XQuery, eXist-db, and XML projects and code snippets (or, in Github terminology, gists) and to contribute to other projects. I’ve come to appreciate its rich version control tools and how it facilitates collaboration. Github Pages brings one more possibility to those Konrad mentioned: your work can be published as public web pages, in whatever form you choose: blog, tutorial, syllabus, book.

I’m particularly excited about using Github Pages for own articles that involve XQuery or XML code, since it does a great job adding helpful “syntax highlighting” to code and allows me to embed my gist code snippets directly in a post. It was a cumbersome and time consuming to get code to look decent on Tumblr (or other sites I’ve tried to post code on before), meaning that I spent time on appearance that could’ve been spent on substance. As you’ll see in the old posts I’ve adapted on this site, Github Pages nails the challenge of XQuery syntax highlighting—quite a feat—and, best of all, required no extra work on my part. It’s no coincidence that I began to investigate Github Pages at the time I began to contemplate a future series of posts involving lots of code.

Before I took the plunge, I decided to search for podcasts to hear informed discussion about Github Pages and Jekyll, the static site generator that undergirds it. I was lucky to find a very recent episode of a show called The Web Ahead that focused on Github Pages and Jekyll.1 Convinced that the pros outweighed the cons, I spent a few hours over the course of a week tinkering with Jekyll, installing it locally, getting a feel for the system, and importing my old articles.

I’m still getting situated here, so please subscribe to this site’s feed with your feed reader of choice. I’ll also post links to new articles on Twitter (I’m @joewiz). Comments are always welcome.

Finally, I’ve posted the full source for this blog on Github, where you can also track its short history so far. You can even see the plain text source file for this article - written, of course, in Markdown.

  1. I guess I’m not the first to have the impression that the host, Jen Simmons, sounds strikingly like Terry Gross.


XQuery's missing third function

06 July 2013
Tags: xquery, search, text mining, and dh.

Even if you don’t know XQuery, if you’ve only heard of it, you know XQuery is a language built for querying, or searching, XML. And if you’ve started learning XQuery for a digital humanities project, you use XQuery because it helps you search the text that you’ve put inside your XML. Given your interest in searching text, it’s likely that the first function you learn in an XQuery class or tutorial is contains(). Take this simple XPath expression:

//book[contains(title, "arm")]

This economical little query searches through all the books in your XML for those with a title that contains the phrase “arm” — you know, all of your books about armadillo shwarma. Tasty, right?

Then in the next lesson you learn about the matches() function, which can do the same simple searches as contains() but can also handle patterns, expressed using a pattern matching language called regular expressions:

//book[matches(title, "[A-Za-z]\d{2}T!")]

This finds titles like “W00T!” and “L33T!” — an upper or lower case letter, two digits, and a capital T, and an exclamation mark. Slick!

Then, naturally, you learn the highlight-matches() function, which turns highlights the phrase or pattern that you searched for:

//book[highlight-matches(title, "[PT]ickl[a-z]+")]

This highlights the matching portion of the book titles: “The Art of Pickling” and “How to Tickle the Ivories like a Pro.” Super!

But wait! The highlight-matches() function never appears in your lessons or class materials. It’s not in the XQuery spec. Not the 1.0 version, not the 3.0 version. Surely, this must be a mistake. No, your teacher says. You google for it. You click through the links. Stuff about indexes? Proprietary functions? The disappointment sets in. Really? No standard way to highlight the search results?

This was my experience, lasting several years, until today. I realized that I could combine two features of XQuery 3.0 — the analyze-text() function and higher order functions — to write a simple, implementation-independent highlight-matches() function, allowing us to write queries like this:

local:highlight-matches(
    //book/title,
    "[PT]ickl[a-z]+",
    function($word) { <b>{$word}</b> }
    )

To make this easier to read, I’ve split the expression onto several lines. Here’s what’s going on:

  1. This should look pretty familiar: we return all the pickling and tickling titles. But instead of applying contains() or matches(), we use local:highlight-matches(). And instead of putting the function inside a predicate (i.e., [inside square brackets]), we put the function outside. This is because our function doesn’t merely serve as a condition (i.e., return all books whose title matches); it actually creates an in-memory copy of the nodes that meet the condition with the highlight function applied.
  2. Whereas we gave contains() and matches() two parameters (title and the phrase/pattern), we pass local:highlight-matches() a third parameter: a function. You may not have ever used a function as a parameter, but this is a perfectly valid thing to do in XQuery 3.0. It’s the idea of “higher order functions” - or functions that can take other functions. The advantage of letting you define your own highlighting function is that you might not want to highlight with <b> tags. Rather, you might want to surround matches with a <person key="#smith_michael"> tag. In other words, you might use highlight-matches() to do more than “highlight.”
  3. Submit the query. The local:highlight-matches() function finds the matching books and works its magic, returning the titles with the properly bolded phrases <b>Pickling</b> and <b>Tickle</b>.

But wait! If this highlight-matches() function isn’t part of the XQuery specification, where can you get it?

I’ve posted the source code to highlight-matches() as a GitHub gist. Copy and paste the code into any XQuery 3.0-compliant engine. Need one? Try eXist-db, which has a handy online sandbox called eXide that you can access from any web browser.

Once you get the sample code working, try writing your own highlighting function to return underlined text or text with a yellow background—or find instances of people whose names you know appear in the text and tag them using <person> or proper TEI.

And enjoy!

(And if you’re one of the people who figured this out long ago, or as soon as XQuery 3.0 came along — which admittedly is still in draft form but whose higher order functions and analyze-string() function that made this possible have been in place for some time now — please take a look at the code and add some comments or submit a pull request. Let’s ensure everyone learns this function right after contains() and matches(), okay?)


Living in an OAuth & JSON World

04 July 2013
Tags: xquery, oauth, apis, json, expath, gov20, twitter, socialmedia, existdb, and opensource.

Another day, another gist. Today’s was prompted by a question on the eXist-db mailing list about how to access OAuth-based services like the Google API with XQuery. I happened to have just been working on accessing the OAuth-based Twitter v1.1 API for the new social media section of my office’s homepage, so I posted the code and some pointers. Like the gist I posted yesterday, I hope others can use these bits of XQuery code.

But there’s a back story and, dare I say, some illustrative lessons, to this latest addition to my series of posts and gists on XQuery.

Until recently, writing a program to retrieve one’s latest tweets was as simple as going to the Twitter homepage is: you just made a basic, unauthenticated HTTP request to Twitter’s servers to get the data you needed. But with version 1.1 of Twitter’s API, Twitter announced a new requirement - that all requests to its API be signed and authenticated using the OAuth 1.0 protocol. This complicated the task of getting data from Twitter exponentially. The OAuth protocol, while not rocket science, requires one to jump through a rather intricate sequence of steps to compose the parameters of your request, and then cryptographically sign the request with a hashing function. (I’m not complaining about the protocol; it does a great job providing an authentication layer to the web. I’m just saying that requiring OAuth to retrieve tweet imposes a pretty heavy burden on users and developers.) If that weren’t enough, Twitter also ended support for the XML-based Atom format, leaving just JSON as the format it returned results as. That left me with two problems.

First, XQuery’s rich function library does not include the HMAC-SHA1 cryptographic hashing algorithm needed to sign OAuth requests. So I turned to Claudius Teodorescu, who applied his considerable Java skills to the task of creating an HMAC-SHA1 function for eXist-db, the XQuery-based server that powers history.state.gov. We took it a step further, releasing Claudius’s work to the EXPath community in the form of an specification: the EXPath Crypto Module. The EXPath community builds up common standards for XPath and XQuery implementations. Claudius also released his module as an EXPath package for eXist-db, which is now available in the eXist-db Public Package Repository for anyone to download and install (to do so, go to eXist-db’s Dashboard, click on Package Manager, and find “EXPath Cryptographic Module Implementation” in the list of packages”). Look at the prolog of the OAuth module I posted in today’s gist, and you’ll see that it imports Claudius’s module.

So I was able to check OAuth off my list of problems.

But besides handling OAuth, I also needed a way to deal with JSON. JSON is increasingly ubiquitous data format in the world of APIs, but its data model is subtly incompatible with XML, and XML-based software like eXist-db has a difficult time ingesting or searching JSON data. Luckily, there were a number of XQuery libraries for me to choose from, and I decided to use one that John Snelson wrote for XQilla. With his permission, I updated it a bit, using some new features in XQuery 3.0 to make his library implementation-independent, and released the updated library on GitHub. Thanks to GitHub’s mechanisms for code contributions (“pull requests”), the library has already received several improvements from the community. The package is also available in eXist-db’s public app repository and the CXAN package repository. (I’m also eagerly following the JSONiq project which is working on extending XQuery to deal natively with JSON, obviating the need to convert JSON to XML to deal with it.)

So I was able to check JSON off my list of problems.

This paved the way for yesterday’s addition of social media links to the homepage of history.state.gov, and coming soon, a complete, searchable social media archive.

All in all, a story — albeit not unique — of open source communities working together to build solutions to common challenges.

For more on social media archives in government — the ultimate objective beyond the immediate goal of displaying our latest tweets on our homepage — see NextGov’s aptly titled article, Saving Government Tweets is Tougher Than You Think.


Trimming text without cutting off words, using XQuery

03 July 2013
Tags: XQuery.

A new Github gist:

Helps trim phrases of arbitrary length to a maximum length, without cutting off words, as the substring() function would inevitably do, or ending on unwanted words

I wrote this to handle Tumblr photo posts, which have no explicit title, only a caption, and in the cases I’ve seen, “captions” are actually full-length posts.  I needed to trim these captions to a maximum length - e.g., 140 characters - without cutting off words or ending on unwanted words - e.g., the.


One paragraph, many sentences

29 June 2013
Tags: xquery, nlp, and exist-db.

Where does each sentence in this post start and end? Given some schooling and well punctuated text, our brains handle this task pretty easily, but it turns out that telling a computer how to split a text into sentences is a bit tricky. In modern English we have a general rule: sentences begin with a capitalized word and end with a period. But there are plenty of exceptions to account for in writing a program to isolate sentences: other words in the sentence might be capitalized, and abbreviations can contain and end with those periods, whether they’re at the end of a sentence or not.

For some time, I’ve wondered about how to find the start and end of sentences, but couldn’t ever devise an approach that worked. Then, recently, inspired by my friend Josh’s comment during a course we were taking on XQuery (“Could we use XQuery to pull out all topic sentences in a manuscript to help ensure the narrative flows logically and smoothly?”), I decided to return to the challenge. After some research I found a hint in this post on stackoverflow, which unlocked a core insight: if you look at the individual words in the text, one at a time and in order, you can look for signs of a sentence break, and then apply logic against surrounding words and account for known exceptions to the rule, such as abbreviations or stock phrases. Proceeding this way through a text, you can isolate each sentence.

So, should you ever have the need for splitting text into sentences—perhaps for looking at all topic sentences in a chapter, or for counting sentences in a paragraph—check out https://gist.github.com/joewiz/5889711. It’s a pair of XQuery functions for analyzing a chunk of text and identifying the sentences within. It’s a naive approach (see my notes at the top of that page), but it does a pretty good job with newspaper articles and other edited English prose.

It takes text like this paragraph (from FRUS):

154613. You should arrange to deliver following note to North Vietnamese Embassy. If in your opinion it can be done without creating an issue, we would prefer that you ask North Vietnamese Charge to come to your Embassy to receive note. “The U.S. Government agrees with the statement of the Government of the DRV, in its note of April 27, that it is necessary for Hanoi and Washington to engage in conversations promptly. The U.S. Government notes that the DRV has now agreed that representatives of the two countries should hold private discussions for the sole purpose of agreeing on a location and date. The U.S. Government notes that the DRV did not respond to its suggestion of April 23 that we meet for this limited purpose in a ‘capital not previously considered by either side.’ The U.S. Government suggested the DRV might wish to indicate three appropriate locations suitable for this limited purpose. The U.S. Government does not consider that the suggestion of Warsaw is responsive or acceptable. The U.S. Government is prepared for these limited discussions on April 30 or several days thereafter. The U.S. Government would welcome the prompt response of the DRV to this suggestion.”

and returns this:

  1. 154613.
  2. You should arrange to deliver following note to North Vietnamese Embassy.
  3. If in your opinion it can be done without creating an issue, we would prefer that you ask North Vietnamese Charge to come to your Embassy to receive note.
  4. “The U.S. Government agrees with the statement of the Government of the DRV, in its note of April 27, that it is necessary for Hanoi and Washington to engage in conversations promptly.
  5. The U.S. Government notes that the DRV has now agreed that representatives of the two countries should hold private discussions for the sole purpose of agreeing on a location and date.
  6. The U.S. Government notes that the DRV did not respond to its suggestion of April 23 that we meet for this limited purpose in a ‘capital not previously considered by either side.’
  7. The U.S. Government suggested the DRV might wish to indicate three appropriate locations suitable for this limited purpose.
  8. The U.S. Government does not consider that the suggestion of Warsaw is responsive or acceptable.
  9. The U.S. Government is prepared for these limited discussions on April 30 or several days thereafter.
  10. The U.S. Government would welcome the prompt response of the DRV to this suggestion.”

I chose this text because of the many capitalized words and abbreviations throughout and the variations in punctuation. I also tested against several New York Times and Boston Globe articles, a tricky portion from Moby Dick that threw off some other utilities, and some made up sentences with edge cases.

If you want to give it a try with your own text, you can copy the entire gist and paste it into eXide, the XQuery sandbox for eXist-db; click “Run” to see the results. (Should work with any XQuery implementation though.)

Thanks for the inspiration, Josh! And thanks to Christine Schwartz ‏for reminding me that GitHub gists are a great place to throw things up—things that may not deserve a full blown repository of their own. But, since gists are repositories under the hood, pull requests are welcome. There’s surely room for improvement in this code.


Reflections on learning XQuery

27 April 2013
Tags: xquery, xml, learning, and dh.

Two weeks ago several of my colleagues and I were lucky enough to take a course on “XQuery for Documents” taught by Michael Sperberg-McQueen.

I’ve taken courses on XQuery before—all excellent—but this one was absolutely unique.

While Michael wasn’t directly involved in the working group that produced the XQuery language, he wasn’t far from it—given his involvement with the W3C and the creation of XML and XSD, not to mention the creation of TEI.  Thus, besides just giving an expert introduction to XQuery, he also shed light on the communities who came together, with different interests, personalities, politics, and intellectual frameworks, to create this remarkable language.

My colleagues and I—all historians, some who specialize in the history of technology—were fascinated by this aspect of the course.  We came away with a solid foundation in the language, and an appreciation for the milestone it marks in the history of programming languages and technology.

(The creation of XQuery, as well as the creation of TEI, would be two worthy dissertation topics.)

Inspired by Michael’s course, I’ve begun thinking about ways to both share my own appreciation for XQuery and related technologies, tools, and tips that I have discovered in my own work, and to give this a human dimension, rather than just a purely instructional one.  I’ve heard it said that the best camera is the one that you have with you.  In that spirit, I’ll start writing here, in a blog that I already have.

So, let’s begin with a brief introduction about how I came to learn XQuery.

In mid-2007, as a freshly minted history PhD, my new job presented me with the challenge of revamping a website for a group of venerable historical publications.  (I’ve written an article about the project.) After researching the formats in use for encoding books and historical documents, I decided to adopt TEI as the format for the publications.   TEI P5—the standard’s 5th major version—was released in October 2007, just in time for my project to adopt it and benefit from its many advances from the start.

This left the question: how to turn the huge volumes of TEI into a website that allow historians and the public to view, browse, and search the publications?  In a superb stroke of luck, I met James Cummings at the TEI conference at U Maryland College Park in October 2007 and told him about my project.  James encouraged me to look into native XML databases.

James’ suggestion led me to eXist-db.  eXist-db’s Shakespeare demos impressed me with their speed and precision.  Moreover, it supported XSLT, which would allow me to use the stylesheets I had adapted from TEI community for turning my XML into HTML for the web.

But besides XSLT, eXist-db also used XQuery for its search and scripting operations.  XQuery 1.0 had just achieved recommendation status in January 2007, and Priscilla Walmsley’s O’Reilly book on XQuery was published in June—the month I graduated.

As I taught myself XQuery with eXist-db (not to mention the indispensable oXygen XML Editor), I found myself ever more comfortable with XQuery.  Thanks to an invaluable hint in the right direction by David Sewell (using the same typeswitch-based approach he outlined in this mailing list posting of his), I migrated all of my XSLT routines to XQuery.  Now, rather than having to master two languages, I could focus on the one that did everything I needed: XQuery.

Ever since that moment, I’ve used XQuery almost exclusively, and together with TEI, eXist-db, and oXygen, it has been my gateway into the world of digital humanities and software development.  XQuery was challenging to learn but still very accessible.  The successes have been rewarding.  Even after five years I am still learning new aspects of the language and uncovering new ways to apply it.

After several years of being the sole user of XQuery in my office, I’m thrilled that so many of my colleagues are learning XQuery.  It makes perfect sense given the amount of TEI and XML that we now have created.  Thanks to Michael for getting us started, and linking us intellectually to the shared sense of purpose and possibilities that led to the creation of XML, TEI, XQuery and the other foundational standards that have enabled and empowered us to do our work.

I look forward to writing more (probably here, but if elsewhere, I’ll post a link)—for them, and anyone interested in following along.


John Horlivy Remembered

30 December 2012
Tags: writing.

On this sleepless night, after feeding my daughter, I began experimenting with Tumblr as a place to record some thoughts longer than 140 characters and came across the option to post “quotes.” The quote that first came to mind was that of an English teacher at my high school, John Horlivy. I never took any of John’s courses, but he became a mentor of sorts to me and several of my friends who were deeply interested in writing poetry. He was so supportive - both during the height of my writing in my sophomore and junior years, and as the poetic juices started to slow in college as my interests turned elsewhere. I remember returning to campus and talking with him about that, and he told me that it was completely normal to go through dry spells - but that the most important thing was to keep writing. “Keep writing.” Well, John, it’s fitting that I dedicate this post - the (only latest) attempt to heed your advice - to you.


Keep writing

30 December 2012
Tags: writing.

Keep writing

–John Horlivy


Upside Down

07 April 2012
Tags: xml, xquery, existdb, and dhoxss.

depalaeographia:

After the Digital.Humanities\@Oxford Summer School 2011 my post-doc project went topsy-turvy. All those mighty Xs (XML, XHTML, XQuery, XPath, eXist, oXygen…) persuaded me to drop SQL and say goodbye to the command line.

Ah, wonderful memories of DHOXSS 2011 - I just came across this post about it from last year.  I wonder how depalaeographia fared?

Oh, and hello, world of tumblr!


An under-appreciated use for XQuery: wrangling plain text

06 February 2012
Tags: tei, xquery, and exist.

In my experience teaching colleagues and students how to use XQuery and eXist-db to create dynamic websites out of TEI documents, the “syllabus” usually starts in one of two forms. The first assumes that we already have well-formed TEI documents, and we can happily dive right into XML data structure manipulation with XPath and XQuery. The second starts with no XML: a PDF, a word document, or plain text. Now, of course, thanks to OxGarage and oXygen, we have some good tools for deriving servicable TEI documents out of other formats. But more often than not, the text needs work. Sometimes, the importer fails to capture the structure implicit in the original. For all of these cases, XQuery proves to be an indispensable tool. XQuery’s regular-expression functions (such as matches, replace, and tokenize), together with its excellent handling of sequences and recursive functions provide all of the tools one could ever need to tackle everything from simple to the most challenging text wrangling tasks.

Let’s examine one challenging text wrangling scenario that XQuery makes quick work of: transforming an outline of subjects from a Nixon Tapes subject log into TEI (this one excerpted from February 1971’s Tape 47):

The President left at 8:48 am
    -Administration recommendations on Capitol Hill
    -Improvements
    -Richardson’s trip to New York
    -Health programs
            -Goals
            -Problems in present system
            -Approach
            -Emphasis on quality
    -Improvements in United States’ health care
            -Maternal deaths
                    -Rate
                    -Decline
                    -United States’ rate compared to other nations
                            -Reporting system
            -Data on health
                    -Differences in reporting system
            -Low-income people
                    -Whites
                    -Non-whites
            -Mortality rates
                    -Figures
    -Resource allocation
            -Rural areas
                    -Availability of care
            -Catastrophic care costs
            -Prevention
            -Problems

This plain-text outline of subjects discussed on tapes in the Nixon White House has a hierarchical structure that is clear to the human eye. But converting this text into a form of XML that captures this structure is challenging. Let’s take a look at how we would represent this text in TEI:

<list>
    <item>The President left at 8:48 am
        <list>
            <item>Administration recommendations on Capitol Hill</item>
            <item>Improvements</item>
            <item>Richardson’s trip to New York</item>
            <item>Health programs
                <list>
                    <item>Goals</item>
                    <item>Problems in present system</item>
                    <item>Approach</item>
                    <item>Emphasis on quality</item>
                </list>
            </item>
            <item>Improvements in United States’ health care
                <list>
                    <item>Maternal deaths<list>
                            <item>Rate</item>
                            <item>Decline</item>
                            <item>United States’ rate compared to other nations
                                <list>
                                    <item>Reporting system</item>
                                </list>
                            </item>
                        </list>
                    </item>
                    <item>Data on health
                        <list>
                            <item>Differences in reporting system</item>
                        </list>
                    </item>
                    <item>Low-income people
                        <list>
                            <item>Whites</item>
                            <item>Non-whites</item>
                        </list>
                    </item>
                    <item>Mortality rates
                        <list>
                            <item>Figures</item>
                        </list>
                    </item>
                </list>
            </item>
            <item>Resource allocation
                <list>
                    <item>Rural areas
                        <list>
                            <item>Availability of care</item>
                        </list>
                    </item>
                    <item>Catastrophic care costs</item>
                    <item>Prevention</item>
                    <item>Problems</item>
                </list>
            </item>
        </list>
    </item>
</list>

One approach to this challenge would be to use find and replace. One of my colleagues tried this, using a tool at hand—Microsoft Word—and it required a 15-step set of find and replace routines, with manual corrections at many steps. Here is how we can use XQuery to accomplish the conversion a single step.  We will write a series of functions that perform the following transformations on the original text:

  1. Take each line of text and turn it into a new XML element, **, which captures the original indent level in an attribute (0 for no indent, 1 for a single tab, 2 for two tabs, and so on).
  2. Place these ** elements into ** elements, and recursively nest the group elements according to the indent levels
  3. Take the new group/line tree, transform it into a TEI list/item tree

Here is the first function, text-to-lines:

declare function local:text-to-lines($text as xs:string) {
    let $lines := tokenize($text, '\n')
    for $line in $lines
    let $level := 
        if (matches($line, '^\s')) then 
            string-length(replace($line, '^(\s*).+$', '$1'))
        else 
            0
    let $content := replace($line, '^\s*(.+)$', '$1')
    return
        <line level="{$level}">{$content}</line>
};

This text-to-lines function uses the tokenize function to split the text file into a sequence of lines (\n is the new line character).  Then, for each line, we want to determine the “level” of indentation.  A line with 0 tabs is not indented, so we can assign it a level of “0”; for each tab, the level of indentation increases by one.  To check for the presence of tabs, we use the regular-expression-enhanced matches function: ^\s checks for a tab (or any whitespace space) at the beginning of the string.  If this “matches” test fails, we can assign a level value of 0.  If there are tabs, we need to isolate the tabs and count them.  We isolate the tabs with regular-expression-enhanced replace function.  Then we count the remaining characters (all tabs) with the string-length function.  We can’t forget the text content of each line, so we use the replace function again to isolate the post-tab and post-hyphen content of each line.  Finally, we construct the new ** element.  

Passing our text to this function returns a new sequence of elements:

<line level="0">The President left at 8:48 am</line>
<line level="1">-Administration recommendations on Capitol Hill</line>
<line level="1">-Improvements</line>
<line level="1">-Richardson’s trip to New York</line>
<line level="1">-Health programs</line>
<line level="2">-Goals</line>
<line level="2">-Problems in present system</line>
<line level="2">-Approach</line>
<line level="2">-Emphasis on quality</line>
<line level="1">-Improvements in United States’ health care</line>
<line level="2">-Maternal deaths</line>
<line level="3">-Rate</line>
<line level="3">-Decline</line>
<line level="3">-United States’ rate compared to other nations</line>
<line level="4">-Reporting system</line>
<line level="2">-Data on health</line>
<line level="3">-Differences in reporting system</line>
<line level="2">-Low-income people</line>
<line level="3">-Whites</line>
<line level="3">-Non-whites</line>
<line level="2">-Mortality rates</line>
<line level="3">-Figures</line>
<line level="1">-Resource allocation</line>
<line level="2">-Rural areas</line>
<line level="3">-Availability of care</line>
<line level="2">-Catastrophic care costs</line>
<line level="2">-Prevention</line>
<line level="2">-Problems</line>

This XML structure now contains all of the information we need in order to take it from this “linear” form into a “nested” form. To do this, we start at the “outer”-most level and work our way “inward”, grouping each level that contains inner levels together, and working recursively through until we reach the innermost items. Our group-lines function will take the sequence of line elements and group them together this way:

declare function local:group-lines($lines as element(line)+) {
    let $first-line := $lines[1]
    let $level := $first-line/@level
    let $next-line-at-same-level := subsequence($lines, 2)[@level eq $level][1]
    let $group-of-lines-inside-this-level := 
        if ($next-line-at-same-level) then 
            subsequence(
                $lines, 
                1, 
                index-of($lines, $next-line-at-same-level) - 1
            )
        else 
            $lines
    return 
        (
        <group>{$group-of-lines-inside-this-level}</group>
        ,
        if ($next-line-at-same-level) then 
            local:group-lines(subsequence($lines, index-of($lines, $next-line-at-same-level)))
        else 
            ()
        )
};

This will process the first “level” of our lines into one or more group elements - in our case, it will result in a single group element, containing all of the original line elements. (If our source list had more than one 0-level line, this function would return as many group elements.) In effect, this function returns the outermost layer of our list.

To get the recursion started, we will pass this outer layer group element to our process-groups function, which in turn passes each group element to the apply-levels function:

declare function local:process-groups($groups as element(group)+) {
    if (count($groups) gt 1) then
        <group>{
            for $group in $groups
            return
                local:apply-levels($group)
        }</group>
    else
        local:apply-levels($groups)
};

declare function local:apply-levels($group as element(group)) {
    <group>
        {$group/line[1]}
        {
        if ($group/line[2]) then 
            if (count(subsequence($group/line, 2)) gt 1) then 
                <group>{
                    for $group in local:group-lines(subsequence($group/line, 2))
                    return
                        local:apply-levels($group)                    
                }</group>
            else
                local:group-lines(subsequence($group/line, 2))
        else ()
        }
    </group>
};

The apply-levels function triggers the real recursive processing of the lines. It takes each group of lines, deposits the first line at the new level, and then runs the remaining lines in the group back through the get-groups function. This time, the get-groups function groups the inner lines according to their levels. This leaves us with a nicely nested set of group and line elements, with the original level attributes intact:

<group>
    <line level="0">The President left at 8:48 am</line>
    <group>
        <group>
            <line level="1">-Administration recommendations on Capitol Hill</line>
        </group>
        <group>
            <line level="1">-Improvements</line>
        </group>
        <group>
            <line level="1">-Richardson’s trip to New York</line>
        </group>
        <group>
            <line level="1">-Health programs</line>
            <group>
                <group>
                    <line level="2">-Goals</line>
                </group>
                <group>
                    <line level="2">-Problems in present system</line>
                </group>
                <group>
                    <line level="2">-Approach</line>
                </group>
                <group>
                    <line level="2">-Emphasis on quality</line>
                </group>
            </group>
        </group>
        <group>
            <line level="1">-Improvements in United States’ health care</line>
            <group>
                <group>
                    <line level="2">-Maternal deaths</line>
                    <group>
                        <group>
                            <line level="3">-Rate</line>
                        </group>
                        <group>
                            <line level="3">-Decline</line>
                        </group>
                        <group>
                            <line level="3">-United States’ rate compared to other nations</line>
                            <group>
                                <line level="4">-Reporting system</line>
                            </group>
                        </group>
                    </group>
                </group>
                <group>
                    <line level="2">-Data on health</line>
                    <group>
                        <line level="3">-Differences in reporting system</line>
                    </group>
                </group>
                <group>
                    <line level="2">-Low-income people</line>
                    <group>
                        <group>
                            <line level="3">-Whites</line>
                        </group>
                        <group>
                            <line level="3">-Non-whites</line>
                        </group>
                    </group>
                </group>
                <group>
                    <line level="2">-Mortality rates</line>
                    <group>
                        <line level="3">-Figures</line>
                    </group>
                </group>
            </group>
        </group>
        <group>
            <line level="1">-Resource allocation</line>
            <group>
                <group>
                    <line level="2">-Rural areas</line>
                    <group>
                        <line level="3">-Availability of care</line>
                    </group>
                </group>
                <group>
                    <line level="2">-Catastrophic care costs</line>
                </group>
                <group>
                    <line level="2">-Prevention</line>
                </group>
                <group>
                    <line level="2">-Problems</line>
                </group>
            </group>
        </group>
    </group>
</group>

As you can see, while this respects the original indentation levels of the source text, this is not yet proper TEI. Also, we have some seemingly redundant group elements. The final step is to whittle this structure down to proper a TEI list/item format. For this, we will write a groups-to-list function, which starts the new list element and calls a helper function, inner-groups-to-list for the remainder of the transformation:

declare function local:groups-to-list($group as element(group)) {
    <list>{local:inner-groups-to-list($group)}</list>
};

declare function local:inner-groups-to-list($group as element(group)) {
    if ($group/line) then
        for $item in $group/line
        return
            <item>{
                $item/text()
                ,
                if ($item/following-sibling::group) then
                    <list>{local:inner-groups-to-list($item/following-sibling::group)}</list>
                else 
                    ()
            }</item>
    else (: if ($group[not(line)]) then :)
        for $g in $group/group 
        return 
            local:inner-groups-to-list($g)
};

Finally, we have a nice TEI list:

<list>
    <item>The President left at 8:48 am<list>
        <item>-Administration recommendations on Capitol Hill</item>
        <item>-Improvements</item>
        <item>-Richardson’s trip to New York</item>
        <item>-Health programs<list>
            <item>-Goals</item>
            <item>-Problems in present system</item>
            <item>-Approach</item>
            <item>-Emphasis on quality</item>
        </list>
        </item>
        <item>-Improvements in United States’ health care<list>
            <item>-Maternal deaths<list>
                <item>-Rate</item>
                <item>-Decline</item>
                <item>-United States’ rate compared to other nations<list>
                    <item>-Reporting system</item>
                </list>
                </item>
            </list>
            </item>
            <item>-Data on health<list>
                <item>-Differences in reporting system</item>
            </list>
            </item>
            <item>-Low-income people<list>
                <item>-Whites</item>
                <item>-Non-whites</item>
            </list>
            </item>
            <item>-Mortality rates<list>
                <item>-Figures</item>
            </list>
            </item>
        </list>
        </item>
        <item>-Resource allocation<list>
            <item>-Rural areas<list>
                <item>-Availability of care</item>
            </list>
            </item>
            <item>-Catastrophic care costs</item>
            <item>-Prevention</item>
            <item>-Problems</item>
        </list>
        </item>
    </list>
    </item>
</list>

We can run this entire transformation in one step:

let $lines := local:text-to-lines($text)
let $groups := local:group-lines($lines)
let $processed-group := local:process-groups($groups)
let $list := local:groups-to-list($processed-group)
return
    $list

From this step, a transformation to HTML for presentation is trivial, and using this new structure to drive search applications is a matter of loading the final document into eXist-db.  We’ve taken our list from a flat text file and have given it structure, enabling a whole range of uses.


Update: This post was migrated from my old posterous.com blog in Dec. 2014, thanks to the Wayback Machine’s copy.

Also, for posterity, I’ve adapted a few comments I was able to retrieve:

  • Chris Wallace (February 6, 2012): Hi Joe - nice example of the power of XQuery - I was reminded of the approach I used in the JSON-to-XML function I wrote so I wrote up my use of util:parse for this problem. (I still have problems putting code in Posterous - how do you do it?

  • Aaron Marrs (February 6, 2012): Fascinating stuff, Joe!

  • Joe Wicentowski (February 7, 2012): Thanks for your comments, Chris and Aaron! Chris - Your idea to use util:parse() is great. Thanks for writing it up and posting code. It shows how XQuery, you can even use text to fold flat text, before bringing it into the XML dimension. (Indeed, working on this challenge and the post has got me thinking how bringing documents into XML is analogous to folding a one dimensional object - linear text - and bringing it into a world of two dimensions - nested nodes or concentric circles.) Everyone: Chris’s alternative approach is at http://kitwallace.posterous.com/converting-an-indented-list-to-a-tree-the-pow. Chris - to answer your question about how I posted the code, I used oXygen to escape the angle brackets in my XML and XQuery, and then pasted the code into Posterous’s HTML view, surrounded by a pre and code set of tags. This was painful. I am thinking of looking into github gists - apparently posterous will automatically expand them. See http://blog.posterous.com/posterous-now-supports-traileraddict-embeds-a.


Why eXist Should Be in Every Digital Humanist's Toolkit

22 November 2010
Tags: tei, xml, xquery, dh, and exist.

Chances are that if you’re in the digital humanities, you either use TEI or some other flavor of XML to store all of your data, or your project uses XML in some key areas. If you use XML, then eXist should be in your toolkit. Why? Well, as you already know, XML is a fantastic way to encode and annotate scholarly data and metadata, but without a database to store it, a web server to publish it, or a search engine to analyze it, your project may fall short of its potential. eXist does all of the above: It’s a fast web server, a powerful database, and a full-featured search engine. (To contrast it with other tools used in digital humanities work, eXist isn’t a content management system like Drupal or Omeka, or a digital object repository like Fedora; it’s more of a database and an application server that can be adapted to your project’s needs.) It’s free, built on open standards, and continually improved by the open source community. It runs on Macs, PCs, and Linux and is easy to install; you can install it anywhere from your netbook or laptop to a desktop computer or a dedicated server.

What does eXist really do with your XML? At its core is the following process: You give it your XML files, and eXist happily stores and indexes it; the files immediately become available for search and retrieval. Then you use “queries” to search within the documents, organize them into collections, and analyze, transform, and publish your data. You can limit eXist to being an XML storage facility that your existing web server draws content from, or you can store your entire web application in eXist (CSS, Javascript, images, and all), and make eXist your project’s website. 

While nothing this powerful could be trivial to learn and use, eXist is entirely feasible to dabble in (or even master) for someone with a humanities background. You or your colleagues will need to learn a language called XQuery, a language designed expressly for the purpose of working with XML. But fear not: XQuery is a high level language that abstracts most of the programming away, and lets you focus on extracting the information you need from your XML. (See below for how to try live examples.) There are excellent resources for learning eXist and XQuery, including a vibrant community of users, many of whom work on humanities applications. In fact, eXist is so flexible and well-suited to the work of the digital humanist that XQuery could be the first and last computer language you’ll ever need to learn. For all these reasons, digital humanists should see eXist as an absolutely essential tool.


One of the most direct ways toget a sense of what functionality and power eXist offers digital humanities projects is to visit eXist’s homepage and browse to eXist’s XQuery Sandbox. The Sandbox contains sample texts (Hamlet, Macbeth, and Romeo & Juliet) and canned queries that you can try, alter, and play with. Find the “Paste Example” drop-down menu, and select the first item: “Simple full text query on the Shakespeare plays.” You’ll see that the query window will populate with the following:

//SPEECH[ft:query(., 'love')]

This query instructs eXist to show all speeches (SPEECH elements) that contain the word “love”—but for now let’s set aside the semantics of the query, and get to the results. Click on the “Send” button. Watch the results of the query stream back to you in the bottom results window. Notice how the word “love” is highlighted in the results to help you see the matching text. (Here’s what the syntax means: //SPEECH asks for all speech elements, and the square-bracketed expression filters or restricts the results to just those that have a match in eXist’s fulltext index for the word “love”. It’s okay not to understand every query now; it’s time to play and experiment.)

Let’s experiment! Try changing the word from “love” to another word (say “cold”), and hit “Send” again. Change the word to bird\*, and notice how the search now returns hits with “bird,” “birds,” and “bird’s”—the asterisk is a wildcard for the ft:query() function. Now try each of the next few options in the drop down menu. By the time you see the 4th option, “Show the context of a match,” the real power of XQuery becomes evident: We’re still searching speeches, but now the results of your search show each speech’s scene, act, and play. This is possible because eXist understands the hierarchical structure of XML, and can use that structure to enhance your search results. You can try as many of the queries as you like. Don’t worry, you can’t do anything wrong here, and even if you did, the eXist homepage resets itself every several hours.


If this demonstration piques your interest and strikes you as having potential for your project, here are 5 steps you can follow to download and install eXist onto your own computer and get working with your own data.

  1. Download eXist: Go to the eXist homepage, click on the big “Download,” and look for the section entitled “Stable Release.” If you are running Windows, download the version ending in “.exe.” Otherwise, if you’re running Mac or Linux, download the version ending in “.jar.” The file that downloads is the eXist installer.  (Note to Windows or Linux users: Before you can install eXist, you need to download and install the Java JDK.)

  2. Install eXist: Once the file is downloaded, double-click on it to start the eXist installer. Follow the prompts to select an installation directory on your hard drive, and choose a password (or leave the password blank for now). The default choices that the installer provides you with are all acceptable. (Once you’ve finished installing eXist, if you navigate to the folder where you installed eXist, you’ll see about 50 files and folders. Keep them all for now, and you can mostly ignore them.)

  3. Start eXist: eXist is different than many applications on your computer, and starting eXist is your first indication of this. When you start eXist, you’ll notice that it’s actually more like a service that runs quietly in the background rather than an application with its own windows and graphical interface; in fact, you usually interact with eXist through other programs, like your web browser. So let’s get it started. Starting eXist on Windows is pretty straight-forward; you’ll find an icon on your desktop called “Start eXist”; double-clicking on this icon will launch a command line window and display a cryptic log of eXist’s startup routine; and just keep this window open. On Linux or Mac, though, you’ll need to open a command line: On Mac, go to Applications > Utilities, and start Terminal. Then use the “cd” command to navigate into the folder where you installed eXist, and type “bin/startup.sh”. You’ll see the log of eXist’s cryptic startup routine, and again, just keep this window open. The contents of this log aren’t important for now, but you should see it advance pretty quickly, until it halts with a message like, “Server has started on ports 8080.” If you see that, you’re golden.

  4. Take eXist for a spin: Now that eXist is running, you can begin interacting with it through your web browser. Open your web browser to http://localhost:8080/exist/, and you’ll see a page very much like eXist’s homepage. (Note: This link only works when your eXist is running. The “localhost” bit means your own computer, and the 8080 bit is a “port” that eXist runs on by default; if this bothers you, don’t worry, since it’s not hard to change eXist’s configuration so you don’t need to type 8080. For now we’ll stick with 8080.) In fact, it is identical to eXist’s homepage, since eXist’s homepage is run, naturally enough, on eXist. Now that eXist is running on your own computer, you don’t have to be on the internet to explore eXist. (You’ll never be bored on a train or plane again.) I’d suggest clicking around a bit to get acquainted with eXist: from the homepage, you’ll find a like to the “Main Documentation,” the “Feature Sheet,” and the all-important “Admin” page. The Admin page will ask you for your username (“admin”) and the password you chose during the installation process, and from here you can perform many useful tasks. For example, you can install the example Shakespeare files and the sample Sandbox by clicking on “Examples Setup” and then “Import Files.” If you want to search eXist’s documentation, you can install it by clicking on “Install Documentation” and then “Generate.” Once you’ve installed the examples and the documentation, it’s instructive to click on the “Browse Collections” panel to see the data you’ve just added to the database: the Shakespeare data is in the “shakespeare” collection, and the the Sandbox example queries are in the “example.xml” file. The root collection is called “db,” so the full path to this file is “/db/example.xml.”

  5. Add your own data: eXist really starts to shine when you add your own data to the database and begin writing queries on your data. There are several ways to upload files to the database, but we’ll start with one simple way. From the Admin page (see step 4), click on “Browse Collections.” Let’s create a new collection for your data. In the “New collection:” field near the bottom of the page, enter “mydata”, and click “Create Collection.” Notice that the new “mydata” collection appears in the listing. Click on the “mydata” collection. It’s empty, so let’s add an XML file. Click on “Choose File,” browse to one of your XML files (if you need one, download more Shakespeare), and click on “Upload.” Notice that the “myfile.xml” is now in the list of files. You can even upload non-XML files, and while they’re not searchable like XML, eXist happily stores them. Now that your data is in eXist, you can return to the Sandbox and begin querying it. It’s unlikely that your data matches the structure of the Shakespeare data, so you’ll need to experiment with your own queries (note that the ft:query() function in the first Sandbox queries above may not work on your data until you’ve added full text indexes to your data; instead, try contains(). To browse through all of the functions like this built into eXist, these are on eXist’s homepage under Function Library or on your local copy of eXist.) If you’re ready to turn your Sandbox query into a webpage with its own URL, save the text of your query to a file ending in “.xq” (e.g. “myquery.xq”) and upload it to your collection; then enter, for example, http://localhost:8080/exist/rest/db/mydata/myfile.xq. If you hit a roadblock, don’t despair. This is a good time to explore online resources for learning XQuery, like the XQuery Wikibook. Priscilla Walmsley’s XQuery (O’Reilly 2007) is a great reference book too. Remember too that you’ve got all of the eXist documentation in your browser, browsable and searchable. Now is a good time to join the eXist-open mailing list (search or subscribe) for answers to your questions about eXist, and the XQuery-talk mailing list (search or  subscribe) for answers to your questions about basic XQuery.

I hope this helps give you a taste of what eXist could offer your digital humanities project, and whets your appetite for more. Questions? Comments?

(This post was inspired by coffee break and hallway conversations I had at the Chicago Colloquium on Digital Humanities and Computer Sciences 2010 meeting. See the tweets:  #dhcs10.)


Update: This post was migrated from my old posterous.com blog in Dec. 2014, thanks to the Wayback Machine’s copy.

Also, for posterity, I’ve adapted a few comments I was able to retrieve:

  • Erik Simpson (December 9, 2010): Thank you for this fine post. As someone just getting into this field, I’m trying to understand the terrain: is eXist an alternative to XSLT, or do they have different functions?

  • Joe Wicentowski (December 9, 2010): Great question. The short answer is that, actually, eXist and XSLT are fundamentally complementary technologies, and it could even be argued that eXist enhances XSLT. Why? First, eXist natively supports XSLT, and does so in some very cool, unique ways. The most straightforward way is that you can use eXist to apply XSLT stylesheets to your TEI (or your favorite flavor of XML) documents. But since all of your XML is stored in the database, you could easily apply a stylesheet to an entire document, a fragment of a document, or an entire collection of documents. You could also apply the stylesheet to just the brief snippets of a fulltext search result. This is a different model than one in which you “transform all of your documents to HTML and upload them to the web server.” eXist is the web server and the database, so it is very flexible about letting you keep your XML intact and pull out just the fragments of documents (or entire collections) that you want to work with for a given purpose, on the fly, without “generating the entire website” in advance or “shredding the document” up for a search engine. This is one advantage of working with a “native XML database” like eXist.

    Another very cool thing about eXist’s XSLT support is that you can use it to create flexible “URL rewriting” rules and pipelines. You could design a pipeline so that “book1.xml” (which consists of TEI divs with IDs of section1, section2, section3, etc.) is accessible on your website via a nice URL like http://mysite.com/book1/section1.html. This section1.html isn’t a file in a physical directory called book1. Instead, eXist interprets this URL as asking for (a) the div inside of book1.xml with ID section1, with (b) the “html” XSLT stylesheet applied to the div. In other words, you can custom craft your URLs to look very clean but in fact apply XSLT stylesheets in some sophisticated ways. So in these ways, you keep all the power of XSLT (and can continue using XSLTs you already have), and extend it by virtue of the fact that you have an XML database.

    I, for one, began using eXist in just this way: I used the standard TEI community stylesheets that I had customized, and I used URL rewriting to interpret URL requests to retrieve just the desired XML for my needs.

    After some time, as I got more and more comfortable with XQuery (since it’s what you use for the logic of your web sites or web services in eXist), I found preferring to work in XQuery rather than XSLT. So I decided to go 100% pure XQuery, and I re-wrote my XSLT stylesheets as XQuery. There’s a growing community of folks who use XQuery for their XML transformations; see the last link below. You might even hear people arguing that one or the other is superior; but in my experience, you can use whichever you’re most comfortable with. eXist doesn’t really care which you use for your document transformations!

    Final note: the XQuery module used to invoke XSLT stylesheets and apply them to XML documents is the “transform” module. See http://demo.exist-db.org/exist/functions/transform. There’s also a good article on using XQuery for your document transformation, much in the way XSLT is used. See http://en.wikibooks.org/wiki/XQuery/Typeswitch_Transformations.

  • Erik Simpson (December 10, 2010): Thanks very much for your detailed and helpful reply. I look forward to doing more with this!

  • cortezthekiller (July 12, 2012): I found your post very interesting. I know the basics of some of the XML technologies (XML Schema, XPath, XSLT, XSL-FO) and I would like to start using XQuery with XML databases. eXist looks like the right choice, but it seems to me quite a bit difficult thing to learn … Do you know about some reference books (O’Reilly, etc.) besides the eXist official website?

  • [I haven’t been able to locate my response yet.]

  • cortezthekiller (July 12, 2012): Thanks for all these references. I’m trying to use XSLT for transforming XML Docbooks into PDF/HTML/ePub. I thought that it would be a good thing to use a database and launch queries for indexing the documents. At first I was thinking to use PHP & MySQL to develop my own web application, but MySQL has very poor XML support. I’ll take it easy and start with eXist & XQuery after the summer.

  • Joe Wicentowski (July 13, 2012): eXist-db is great for creating dynamic web applications for searching transforming DocBook with XSLT. Let us know on exist-open if we can help.

  • cortezthekiller (July 14, 2012): I wonder if the Higgs Boson is made of XML, too :) One last thing: do you recommend some IDE for eXist-db? I’ve seen there’s an Eclipse plugin and one IDE named eXide. Thanks a lot.