☯ joewiz.org

Where Joe Wicentowski writes, when 140 characters doesn't suffice.

A Google Chrome bookmarklet for fixing URLs with DNS errors

16 December 2014
Tags: chrome and bookmarklet.

Bookmarklets are great. They’re just like any bookmark you add to your web browser, except they contain a snippet of Javascript that runs when you click on it. One of the first bookmarklets I used was for Instapaper; it passes the window’s current URL to the services so it could save the page for you to read later.

Today I had a burning need for a new bookmarklet. I received an email with a link to a URL redirect service that was located behind a corporate firewall, inaccessible from the public internet. The URL was something like this:


If you try to open this link, your browser will naturally tell you that “This webpage is not available.” Fixing the URL is easy enough: just chop off everything before the ?url= bit. But doing this over and over struck me as a chore that a bookmarklet could take care of. I’d written a few convenience bookmarklets in the past to switch between local and remote versions of my website. The javascript in these bookmarklets looks something like this one, which replaces the server name in the window’s URL with localhost:8080:


With the corporate intranet URL, though, I ran right into a roadblock using this approach: window.location.href was returning data:text/html,chromewebdata instead of the URL in the location bar. The problem was that Chrome’s DNS error page had usurped the window’s location! It seemed that the original URL was nowhere to be found in the DOM in window.* where I was used to looking.

But then I noticed that the text of Chrome’s error page contained the original URL. While View Source was of no help, I brought up Google’s Developer Tools (right-click on a web page, and select Inspect Element) and dug into the actual source code of the error page. I found the full original URL in a Javascript variable accessible via templateData.summary.failedUrl. Typing this phrase into the Developer Console returned the original URL.


… returned:


Bingo! Now I just needed strip off everything before the ?url= parameter. Regex to the rescue!


… returned:


With this in hand, I was able to create my bookmarklet:


If you’d like to use this bookmarklet too, just drag the following link into your bookmark bar:


(Note: Jekyll appears to be escaping some of the characters in this bookmarklet. It still works, but if you avoid %-encoded characters when possible as I do, just replace the bookmarklet’s URL with the sample code above.)

Filling PDF forms with PDFtk, XFDF, and XQuery

13 February 2014
Tags: XQuery, eXist-db, PDF, PDFtk, and XFDF.

It’s hard to believe that PDF is still used for forms these days; they seem so last decade, or even century, compared to what’s possible on the web. I dread the tedium and boredom of filling out a PDF form with tons of fields or creating multiple editions of a form with different information on each.

To illustrate the pain of a PDF form, imagine you have a single blank form—take NARA’s SF702 form. This is a good example, because it contains so many fields. Some of the fields appear to be intended to be filled out with a pen, but it would save time if you could pre-populate the form with the days of each month. But a year’s worth of forms would mean 365 fields. Tab through the form, noting how the cursor jumps through the fields out of order. Entering these forms with a program like Adobe Acrobat is an RSI-inducing nightmare. You’d be tempted to give up and use a pen, which rather defeats the purpose of an electronic form.

What if you could generate and organize your data first and then have a program do the tedious form filling for you? You might know how to generate dates with a spreadsheet or with a programming language like XQuery. But how can you get data onto the form? How can you automate the task of filling in the data? Here’s how.

The key resource that got me started was a very helpful article called “Fill Online PDF Forms” by Sid Steward. The article introduced Sid’s tool, PDFtk—a free command line tool for manipulating PDF files—and showed how to extract information about the fields in a PDF form and use this information to generate a Forms Data Format (FDF) file with your data and apply it to the PDF. I downloaded the installer for my Mac (it works on Windows and Linux too), installed it, and got to work extracting the fields from my PDF form, sf702.pdf, with this Terminal command:

$ pdftk sf702.pdf dump_data_fields > sf702-data-fields.txt

This command tells PDFtk to look for the data fields—the text boxes or other input widgets on the form—from the PDF and “dump” them into a new file, sf702-data-fields.txt.

If you look at the data fields file, you’ll see a series of entries like this, one per data field that PDFtk identified:

FieldType: Text
FieldName: form1[0].#subform[0].MONTHYEAR[0]
FieldNameAlt: MONTH/YEAR. Enter 2 digit month and 2 digit year.
FieldFlags: 8388608
FieldJustification: Left

Of these five entries per field, the key entry for our purposes is the second one, FieldName. You’ll use FieldName to tell PDFtk which field to insert each piece of your data into. (I’d suggest that you compare the order of the entries in the data fields file with the order of the fields in the original form by pressing the tab key in your PDF viewer to advance from field to field; the order should be the same, which is quite handy when finding poorly named field names.)

As Sid’s article explains, once you know the FieldName values for your form, you then need to generate a Forms Data Format (FDF) file that contains the values for each field you want to fill in on the form. An FDF file is just a text file that a program like PDFtk can use to fill your data into the correct field. So what does FDF look like? Sid provided the following example, with two fields (city and state) and two corresponding values to fill in (San Francisco and California):

1 0 obj
<< /FDF << /Fields [	<< /T (city)  /V (San Francisco) >>
                                 << /T (state) /V (California) >> ] >>
trailer << /Root 1 0 R >>

Yuck! A nasty, domain-specific format! Sid kindly provided a PHP-based forge_fdf script to ease the pain of creating an FDF file, but I didn’t want to muck with PHP; nothing against PHP, but I wanted to use XQuery, the programming language I happen to know best. I knew I could write a script to generate the calendar and other data, but the sample FDF looked awful, and the thought of generating FDF with XQuery nearly made me give up.

As one final effort, I searched Google for “FDF XML” and was excited to find an XML edition of FDF, called XFDF, the XML Forms Data Format. Instead of needing to generate something like the FDF sample above, I just needed a nice, clean XML document in the XFDF namespace, containing field elements and a simple name-value structure like this:

<xfdf xmlns="http://ns.adobe.com/xfdf/">
        <field name="city">
            <value>San Francisco</value>
        <field name="state">

Anyone who has used XML could easily code this by hand, but if you add a little XQuery knowledge, you could dynamically generate fields and values easily. And best of all, Sid’s PDFtk tool is just as happy to take XFDF and plain old FDF and apply it to your form.

So I set about creating my XFDF files. I fired up eXist-db, started its built-in XQuery editor, eXide (live demo). Knowing that I would need calendaring functions for my form, I used eXist-db’s Package Manager to install the FunctX library, a library with useful functions, including days-in-month(). Since I needed 12 forms, one for each month in the year, I wrote an XQuery that generated 12 separate XFDF documents. You can see my full XQuery for my specific form, but I present a slightly simplified version here:

xquery version "3.0";

import module namespace functx="http://www.functx.com";

let $year := 2014
    (: Generate the months of year as padded, two-digit values, e.g., January > 01 :)
let $months := (1 to 12) ! functx:pad-integer-to-length(., 2) 
for $month in $months
    (: Craft an xs:date for each month to calculate the days in the month :)
let $first-day := xs:date(string-join(($year, $month, '01'), '-')) 
let $days := (1 to functx:days-in-month($first-day)) ! functx:pad-integer-to-length(., 2)
let $month-year := $month || '/' || substring($year, 3, 2)
    <xfdf xmlns="http://ns.adobe.com/xfdf/" xml:space="preserve">
            <field name="form1[0].#subform[0].MONTHYEAR[0]">
            for $day at $n in $days 
            let $field-name :=
                if ($n le 22) then 
                    'form1[0].#subform[0].Date' || $n || '[0]'
                    'form1[0].#subform[0].Date' || $n - 22 || '[1]'
                <field name="{$field-name}">

This script generates 12 XFDF elements, one for each month of the year, with fields for month-year (e.g., 01/14 for January 2014) and for each day of the month (e.g., 01-22 for the first column of fields containing 22 days of the month, and 23 forward for the second column containing the remaining days of the month). Notice also that I had to go through effort to generate the correct field names (e.g., form1[0].#subform[0].Date1[0] for the field containing the first day of the month). These were the field names as I found them in the text file of field data that I generated above with PDFtk. Although the field names were complex, at least there was a pattern that I could use to generate them without typing each one.

Once I had my XQuery working correctly, I copied the resulting XFDF documents out of the database onto disk and crafted a bash script to apply the XFDF files onto the form with PDFtk:

$ for FDF in *.xml; do BASENAME=`basename -s .xml $FDF`; pdftk sf702.pdf fill_form "$FDF" output $BASENAME.pdf; done

This takes each XFDF file (e.g., sf702-01.xml with the January form), applies the file to the blank PDF form (sf702.pdf), and saves the filled form as a new PDF using the basename of the XFDF file (sf702-01) plus the .pdf prefix to yield the sf702-01.pdf file for January.

It worked like a charm!

To illustrate some other capabilities of PDFtk, I merged the PDFs together to form a single file for convenience:

$ pdftk *.pdf cat output sf702-2014.pdf

And I “flattened” the form data to save space (over 60% in my case), which also makes the form no longer editable:

$ pdftk sf702-2014.pdf output sf702-2014-flattened.pdf flatten

Jekyll Blogging in the Browser with prose.io

20 August 2013
Tags: writing and jekyll.

In my continuing quest to find good tools for writing for this Jekyll- and Github Pages-based blog (see my last post about writing from the iPhone), let’s turn to the browser.

Imagine this scenario: You’re sitting at a computer and want to write a new post or edit an existing one. You could use GitHub’s web-based editor. It’s quite good. But it only takes a look at how GitHub handles the job of previewing the Markdown source of that same last post to understand that GitHub’s Markdown parsing abilities aren’t great with Jekyll’s YAML metadata (which holds items like the post’s title, date, etc.). Github’s editor succeeds at letting you edit and commit your changes directly to Github, but comes short when it comes to Jekyll Markdown files.

Enter prose, a highly polished and beautifully designed web-based GitHub- and Jekyll-aware text editor. While you can grab the source and write an entire CMS around it (as prose.io developers Development Seed did rather radically for https://www.healthcare.gov/, you can also use the prose.io site itself to edit files in any of your GitHub repositories. In fact, right off the bat when visiting the site, you’re prompted to authenticate via GitHub’s OAuth interface. Once authenticated, you’ll see your list of repositories and can select a file to edit or create a new file.

As you would expect, Prose has a nice Markdown-aware editor and preview tool and lets you commit your changes directly to Github, but the Jekyll-aware functions lay dormant until you choose to edit a Jekyll post or create a new file in a Jekyll repository. A new icon appears in the editor’s sidebar: Raw Metadata. By default this is a dumb text area with YAML syntax highlighting. But it’s actually quite configurable to display just those fields you want to see, like Title, Tags, etc. When creating a new file, prose.io also prepopulates the filename field with the conventional Jekyll date prefix, yyyy-mm-dd. A nice touch.

I first heard about prose.io on episode 54 of The Web Ahead podcast. I discussed the episode in my first post about Jekyll, and it’s worth a listen to learn about Jekyll in general. But the guests on this episode also happened to be Dave Cole and Young Han from Development Seed, the crew that created prose.io. If you want to listen to this section of the episode, the discussion about prose.io starts at 45:20.

Were I to start my site again, I’d probably base it off the prose.io starter project. It includes prose-specific configuration - meaning you don’t have to add this configuration after the fact.

One caveat about prose: If you have started a new file, you will lose your work if you click on the prose.io icon to browse the About link or hit the browser’s back button. Ask me how I know. This is the second time I’ve typed this article in prose.io. So be sure to click on the Save icon. I’ve chimed in on an existing issue about this.

Mobile Blogging with Jekyll

18 August 2013
Tags: writing and jekyll.

One downside of moving to Jekyll (as I recently did with this blog) is that you leave the vibrant ecosystem of mobile blogging apps. Tumblr’s app, for example, let me get thoughts down with my one free hand and post them right away, or edit old posts with ease.

But mourn no more, Jekyll bloggers! Octopage is an iPhone app that lets you edit and create new posts to your GitHub Pages-based Jekyll site. (Despite the Octo- prefix, it works fine with plain old Jekyll and doesn’t presuppose Octopress.) Start the app, enter your GitHub credentials, select your GitHub Pages repository, and you’re ready to begin editing existing posts or creating a new one. Octopage even prepopulates new posts with the requisite YAML front matter for Jekyll posts and has several settings for code block style and autocapitalization that create the right environment for your editing. Built in Markdown preview and syntax guide are nice touches for people like me who are still learning Markdown.

A few nits: I would prefer that Octopage use Github’s official OAuth API for authentication. Entering my username and password into an app feels so 2010. I encountered some display glitches that left me with only 2 lines visible. Making the app universal and thus pleasant to use on the iPad would be great.

Ultimately, I hope more apps add native Github Pages and Jekyll publishing abilities. We need a more vibrant ecosystem for Jekyll. I really hope that the fantastic new Editorial app for iPad will rise to the occasion and add these capabilities. Or perhaps Octopage’s creator will integrate directly with Editorial, something Editorial was built for.

But for now, Octopage gets the job done and does so quite nicely. To prove it, I wrote this entire post in Octopage — with my left thumb while my daughter napped in my lap.

Goodbye Tumblr, hello Github

23 July 2013
Tags: tumblr, github, jekyll, and writing.

I’ve moved my blog from Tumblr to Github Pages. While Tumblr is a capable blogging platform, I had become frustrated with it. I began longing for a way to write more simply and yet with more rigor — with simple, plain text.

The flip side of my appreciation for sophisticated markup vocabularies like TEI is my love of its plain text underpinnings: clean, precise, portable. The plain text-based Markdown format (explained nicely and briefly by Brett Terpstra) is perfect for writing, especially for the web. Github is hardly alone in supporting the increasingly pervasive Markdown, but Github’s Markdown support is solid. That said, there’s more to Github Pages than Markdown support.

As Konrad Lawson has written, Github is a promising platform for writing and collaborating on text of all kind, from prose to syllabi. I’m not new to Github but hadn’t considered this before reading Konrad’s article. Until now I’ve primarily used Github to house my own XQuery, eXist-db, and XML projects and code snippets (or, in Github terminology, gists) and to contribute to other projects. I’ve come to appreciate its rich version control tools and how it facilitates collaboration. Github Pages brings one more possibility to those Konrad mentioned: your work can be published as public web pages, in whatever form you choose: blog, tutorial, syllabus, book.

I’m particularly excited about using Github Pages for own articles that involve XQuery or XML code, since it does a great job adding helpful “syntax highlighting” to code and allows me to embed my gist code snippets directly in a post. It was a cumbersome and time consuming to get code to look decent on Tumblr (or other sites I’ve tried to post code on before), meaning that I spent time on appearance that could’ve been spent on substance. As you’ll see in the old posts I’ve adapted on this site, Github Pages nails the challenge of XQuery syntax highlighting—quite a feat—and, best of all, required no extra work on my part. It’s no coincidence that I began to investigate Github Pages at the time I began to contemplate a future series of posts involving lots of code.

Before I took the plunge, I decided to search for podcasts to hear informed discussion about Github Pages and Jekyll, the static site generator that undergirds it. I was lucky to find a very recent episode of a show called The Web Ahead that focused on Github Pages and Jekyll.1 Convinced that the pros outweighed the cons, I spent a few hours over the course of a week tinkering with Jekyll, installing it locally, getting a feel for the system, and importing my old articles.

I’m still getting situated here, so please subscribe to this site’s feed with your feed reader of choice. I’ll also post links to new articles on Twitter (I’m @joewiz). Comments are always welcome.

Finally, I’ve posted the full source for this blog on Github, where you can also track its short history so far. You can even see the plain text source file for this article - written, of course, in Markdown.

  1. I guess I’m not the first to have the impression that the host, Jen Simmons, sounds strikingly like Terry Gross.

XQuery's missing third function

06 July 2013
Tags: xquery, search, text mining, and dh.

Even if you don’t know XQuery, if you’ve only heard of it, you know XQuery is a language built for querying, or searching, XML. And if you’ve started learning XQuery for a digital humanities project, you use XQuery because it helps you search the text that you’ve put inside your XML. Given your interest in searching text, it’s likely that the first function you learn in an XQuery class or tutorial is contains(). Take this simple XPath expression:

//book[contains(title, "arm")]

This economical little query searches through all the books in your XML for those with a title that contains the phrase “arm” — you know, all of your books about armadillo shwarma. Tasty, right?

Then in the next lesson you learn about the matches() function, which can do the same simple searches as contains() but can also handle patterns, expressed using a pattern matching language called regular expressions:

//book[matches(title, "[A-Za-z]\d{2}T!")]

This finds titles like “W00T!” and “L33T!” — an upper or lower case letter, two digits, and a capital T, and an exclamation mark. Slick!

Then, naturally, you learn the highlight-matches() function, which turns highlights the phrase or pattern that you searched for:

//book[highlight-matches(title, "[PT]ickl[a-z]+")]

This highlights the matching portion of the book titles: “The Art of Pickling” and “How to Tickle the Ivories like a Pro.” Super!

But wait! The highlight-matches() function never appears in your lessons or class materials. It’s not in the XQuery spec. Not the 1.0 version, not the 3.0 version. Surely, this must be a mistake. No, your teacher says. You google for it. You click through the links. Stuff about indexes? Proprietary functions? The disappointment sets in. Really? No standard way to highlight the search results?

This was my experience, lasting several years, until today. I realized that I could combine two features of XQuery 3.0 — the analyze-text() function and higher order functions — to write a simple, implementation-independent highlight-matches() function, allowing us to write queries like this:

    function($word) { <b>{$word}</b> }

To make this easier to read, I’ve split the expression onto several lines. Here’s what’s going on:

  1. This should look pretty familiar: we return all the pickling and tickling titles. But instead of applying contains() or matches(), we use local:highlight-matches(). And instead of putting the function inside a predicate (i.e., [inside square brackets]), we put the function outside. This is because our function doesn’t merely serve as a condition (i.e., return all books whose title matches); it actually creates an in-memory copy of the nodes that meet the condition with the highlight function applied.
  2. Whereas we gave contains() and matches() two parameters (title and the phrase/pattern), we pass local:highlight-matches() a third parameter: a function. You may not have ever used a function as a parameter, but this is a perfectly valid thing to do in XQuery 3.0. It’s the idea of “higher order functions” - or functions that can take other functions. The advantage of letting you define your own highlighting function is that you might not want to highlight with <b> tags. Rather, you might want to surround matches with a <person key="#smith_michael"> tag. In other words, you might use highlight-matches() to do more than “highlight.”
  3. Submit the query. The local:highlight-matches() function finds the matching books and works its magic, returning the titles with the properly bolded phrases <b>Pickling</b> and <b>Tickle</b>.

But wait! If this highlight-matches() function isn’t part of the XQuery specification, where can you get it?

I’ve posted the source code to highlight-matches() as a GitHub gist. Copy and paste the code into any XQuery 3.0-compliant engine. Need one? Try eXist-db, which has a handy online sandbox called eXide that you can access from any web browser.

Once you get the sample code working, try writing your own highlighting function to return underlined text or text with a yellow background—or find instances of people whose names you know appear in the text and tag them using <person> or proper TEI.

And enjoy!

(And if you’re one of the people who figured this out long ago, or as soon as XQuery 3.0 came along — which admittedly is still in draft form but whose higher order functions and analyze-string() function that made this possible have been in place for some time now — please take a look at the code and add some comments or submit a pull request. Let’s ensure everyone learns this function right after contains() and matches(), okay?)

Living in an OAuth & JSON World

04 July 2013
Tags: xquery, oauth, apis, json, expath, gov20, twitter, socialmedia, existdb, and opensource.

Another day, another gist. Today’s was prompted by a question on the eXist-db mailing list about how to access OAuth-based services like the Google API with XQuery. I happened to have just been working on accessing the OAuth-based Twitter v1.1 API for the new social media section of my office’s homepage, so I posted the code and some pointers. Like the gist I posted yesterday, I hope others can use these bits of XQuery code.

But there’s a back story and, dare I say, some illustrative lessons, to this latest addition to my series of posts and gists on XQuery.

Until recently, writing a program to retrieve one’s latest tweets was as simple as going to the Twitter homepage is: you just made a basic, unauthenticated HTTP request to Twitter’s servers to get the data you needed. But with version 1.1 of Twitter’s API, Twitter announced a new requirement - that all requests to its API be signed and authenticated using the OAuth 1.0 protocol. This complicated the task of getting data from Twitter exponentially. The OAuth protocol, while not rocket science, requires one to jump through a rather intricate sequence of steps to compose the parameters of your request, and then cryptographically sign the request with a hashing function. (I’m not complaining about the protocol; it does a great job providing an authentication layer to the web. I’m just saying that requiring OAuth to retrieve tweet imposes a pretty heavy burden on users and developers.) If that weren’t enough, Twitter also ended support for the XML-based Atom format, leaving just JSON as the format it returned results as. That left me with two problems.

First, XQuery’s rich function library does not include the HMAC-SHA1 cryptographic hashing algorithm needed to sign OAuth requests. So I turned to Claudius Teodorescu, who applied his considerable Java skills to the task of creating an HMAC-SHA1 function for eXist-db, the XQuery-based server that powers history.state.gov. We took it a step further, releasing Claudius’s work to the EXPath community in the form of an specification: the EXPath Crypto Module. The EXPath community builds up common standards for XPath and XQuery implementations. Claudius also released his module as an EXPath package for eXist-db, which is now available in the eXist-db Public Package Repository for anyone to download and install (to do so, go to eXist-db’s Dashboard, click on Package Manager, and find “EXPath Cryptographic Module Implementation” in the list of packages”). Look at the prolog of the OAuth module I posted in today’s gist, and you’ll see that it imports Claudius’s module.

So I was able to check OAuth off my list of problems.

But besides handling OAuth, I also needed a way to deal with JSON. JSON is increasingly ubiquitous data format in the world of APIs, but its data model is subtly incompatible with XML, and XML-based software like eXist-db has a difficult time ingesting or searching JSON data. Luckily, there were a number of XQuery libraries for me to choose from, and I decided to use one that John Snelson wrote for XQilla. With his permission, I updated it a bit, using some new features in XQuery 3.0 to make his library implementation-independent, and released the updated library on GitHub. Thanks to GitHub’s mechanisms for code contributions (“pull requests”), the library has already received several improvements from the community. The package is also available in eXist-db’s public app repository and the CXAN package repository. (I’m also eagerly following the JSONiq project which is working on extending XQuery to deal natively with JSON, obviating the need to convert JSON to XML to deal with it.)

So I was able to check JSON off my list of problems.

This paved the way for yesterday’s addition of social media links to the homepage of history.state.gov, and coming soon, a complete, searchable social media archive.

All in all, a story — albeit not unique — of open source communities working together to build solutions to common challenges.

For more on social media archives in government — the ultimate objective beyond the immediate goal of displaying our latest tweets on our homepage — see NextGov’s aptly titled article, Saving Government Tweets is Tougher Than You Think.

Trimming text without cutting off words, using XQuery

03 July 2013
Tags: XQuery.

A new Github gist:

Helps trim phrases of arbitrary length to a maximum length, without cutting off words, as the substring() function would inevitably do, or ending on unwanted words

I wrote this to handle Tumblr photo posts, which have no explicit title, only a caption, and in the cases I’ve seen, “captions” are actually full-length posts.  I needed to trim these captions to a maximum length - e.g., 140 characters - without cutting off words or ending on unwanted words - e.g., the.

One paragraph, many sentences

29 June 2013
Tags: xquery, nlp, and exist-db.

Where does each sentence in this post start and end? Given some schooling and well punctuated text, our brains handle this task pretty easily, but it turns out that telling a computer how to split a text into sentences is a bit tricky. In modern English we have a general rule: sentences begin with a capitalized word and end with a period. But there are plenty of exceptions to account for in writing a program to isolate sentences: other words in the sentence might be capitalized, and abbreviations can contain and end with those periods, whether they’re at the end of a sentence or not.

For some time, I’ve wondered about how to find the start and end of sentences, but couldn’t ever devise an approach that worked. Then, recently, inspired by my friend Josh’s comment during a course we were taking on XQuery (“Could we use XQuery to pull out all topic sentences in a manuscript to help ensure the narrative flows logically and smoothly?”), I decided to return to the challenge. After some research I found a hint in this post on stackoverflow, which unlocked a core insight: if you look at the individual words in the text, one at a time and in order, you can look for signs of a sentence break, and then apply logic against surrounding words and account for known exceptions to the rule, such as abbreviations or stock phrases. Proceeding this way through a text, you can isolate each sentence.

So, should you ever have the need for splitting text into sentences—perhaps for looking at all topic sentences in a chapter, or for counting sentences in a paragraph—check out https://gist.github.com/joewiz/5889711. It’s a pair of XQuery functions for analyzing a chunk of text and identifying the sentences within. It’s a naive approach (see my notes at the top of that page), but it does a pretty good job with newspaper articles and other edited English prose.

It takes text like this paragraph (from FRUS):

154613. You should arrange to deliver following note to North Vietnamese Embassy. If in your opinion it can be done without creating an issue, we would prefer that you ask North Vietnamese Charge to come to your Embassy to receive note. “The U.S. Government agrees with the statement of the Government of the DRV, in its note of April 27, that it is necessary for Hanoi and Washington to engage in conversations promptly. The U.S. Government notes that the DRV has now agreed that representatives of the two countries should hold private discussions for the sole purpose of agreeing on a location and date. The U.S. Government notes that the DRV did not respond to its suggestion of April 23 that we meet for this limited purpose in a ‘capital not previously considered by either side.’ The U.S. Government suggested the DRV might wish to indicate three appropriate locations suitable for this limited purpose. The U.S. Government does not consider that the suggestion of Warsaw is responsive or acceptable. The U.S. Government is prepared for these limited discussions on April 30 or several days thereafter. The U.S. Government would welcome the prompt response of the DRV to this suggestion.”

and returns this:

  1. 154613.
  2. You should arrange to deliver following note to North Vietnamese Embassy.
  3. If in your opinion it can be done without creating an issue, we would prefer that you ask North Vietnamese Charge to come to your Embassy to receive note.
  4. “The U.S. Government agrees with the statement of the Government of the DRV, in its note of April 27, that it is necessary for Hanoi and Washington to engage in conversations promptly.
  5. The U.S. Government notes that the DRV has now agreed that representatives of the two countries should hold private discussions for the sole purpose of agreeing on a location and date.
  6. The U.S. Government notes that the DRV did not respond to its suggestion of April 23 that we meet for this limited purpose in a ‘capital not previously considered by either side.’
  7. The U.S. Government suggested the DRV might wish to indicate three appropriate locations suitable for this limited purpose.
  8. The U.S. Government does not consider that the suggestion of Warsaw is responsive or acceptable.
  9. The U.S. Government is prepared for these limited discussions on April 30 or several days thereafter.
  10. The U.S. Government would welcome the prompt response of the DRV to this suggestion.”

I chose this text because of the many capitalized words and abbreviations throughout and the variations in punctuation. I also tested against several New York Times and Boston Globe articles, a tricky portion from Moby Dick that threw off some other utilities, and some made up sentences with edge cases.

If you want to give it a try with your own text, you can copy the entire gist and paste it into eXide, the XQuery sandbox for eXist-db; click “Run” to see the results. (Should work with any XQuery implementation though.)

Thanks for the inspiration, Josh! And thanks to Christine Schwartz ‏for reminding me that GitHub gists are a great place to throw things up—things that may not deserve a full blown repository of their own. But, since gists are repositories under the hood, pull requests are welcome. There’s surely room for improvement in this code.

Reflections on learning XQuery

27 April 2013
Tags: xquery, xml, learning, and dh.

Two weeks ago several of my colleagues and I were lucky enough to take a course on “XQuery for Documents” taught by Michael Sperberg-McQueen.

I’ve taken courses on XQuery before—all excellent—but this one was absolutely unique.

While Michael wasn’t directly involved in the working group that produced the XQuery language, he wasn’t far from it—given his involvement with the W3C and the creation of XML and XSD, not to mention the creation of TEI.  Thus, besides just giving an expert introduction to XQuery, he also shed light on the communities who came together, with different interests, personalities, politics, and intellectual frameworks, to create this remarkable language.

My colleagues and I—all historians, some who specialize in the history of technology—were fascinated by this aspect of the course.  We came away with a solid foundation in the language, and an appreciation for the milestone it marks in the history of programming languages and technology.

(The creation of XQuery, as well as the creation of TEI, would be two worthy dissertation topics.)

Inspired by Michael’s course, I’ve begun thinking about ways to both share my own appreciation for XQuery and related technologies, tools, and tips that I have discovered in my own work, and to give this a human dimension, rather than just a purely instructional one.  I’ve heard it said that the best camera is the one that you have with you.  In that spirit, I’ll start writing here, in a blog that I already have.

So, let’s begin with a brief introduction about how I came to learn XQuery.

In mid-2007, as a freshly minted history PhD, my new job presented me with the challenge of revamping a website for a group of venerable historical publications.  (I’ve written an article about the project.) After researching the formats in use for encoding books and historical documents, I decided to adopt TEI as the format for the publications.   TEI P5—the standard’s 5th major version—was released in October 2007, just in time for my project to adopt it and benefit from its many advances from the start.

This left the question: how to turn the huge volumes of TEI into a website that allow historians and the public to view, browse, and search the publications?  In a superb stroke of luck, I met James Cummings at the TEI conference at U Maryland College Park in October 2007 and told him about my project.  James encouraged me to look into native XML databases.

James’ suggestion led me to eXist-db.  eXist-db’s Shakespeare demos impressed me with their speed and precision.  Moreover, it supported XSLT, which would allow me to use the stylesheets I had adapted from TEI community for turning my XML into HTML for the web.

But besides XSLT, eXist-db also used XQuery for its search and scripting operations.  XQuery 1.0 had just achieved recommendation status in January 2007, and Priscilla Walmsley’s O’Reilly book on XQuery was published in June—the month I graduated.

As I taught myself XQuery with eXist-db (not to mention the indispensable oXygen XML Editor), I found myself ever more comfortable with XQuery.  Thanks to an invaluable hint in the right direction by David Sewell (using the same typeswitch-based approach he outlined in this mailing list posting of his), I migrated all of my XSLT routines to XQuery.  Now, rather than having to master two languages, I could focus on the one that did everything I needed: XQuery.

Ever since that moment, I’ve used XQuery almost exclusively, and together with TEI, eXist-db, and oXygen, it has been my gateway into the world of digital humanities and software development.  XQuery was challenging to learn but still very accessible.  The successes have been rewarding.  Even after five years I am still learning new aspects of the language and uncovering new ways to apply it.

After several years of being the sole user of XQuery in my office, I’m thrilled that so many of my colleagues are learning XQuery.  It makes perfect sense given the amount of TEI and XML that we now have created.  Thanks to Michael for getting us started, and linking us intellectually to the shared sense of purpose and possibilities that led to the creation of XML, TEI, XQuery and the other foundational standards that have enabled and empowered us to do our work.

I look forward to writing more (probably here, but if elsewhere, I’ll post a link)—for them, and anyone interested in following along.

John Horlivy Remembered

30 December 2012
Tags: writing.

On this sleepless night, after feeding my daughter, I began experimenting with Tumblr as a place to record some thoughts longer than 140 characters and came across the option to post “quotes.” The quote that first came to mind was that of an English teacher at my high school, John Horlivy. I never took any of John’s courses, but he became a mentor of sorts to me and several of my friends who were deeply interested in writing poetry. He was so supportive - both during the height of my writing in my sophomore and junior years, and as the poetic juices started to slow in college as my interests turned elsewhere. I remember returning to campus and talking with him about that, and he told me that it was completely normal to go through dry spells - but that the most important thing was to keep writing. “Keep writing.” Well, John, it’s fitting that I dedicate this post - the (only latest) attempt to heed your advice - to you.

Keep writing

30 December 2012
Tags: writing.

Keep writing

–John Horlivy

Upside Down

07 April 2012
Tags: xml, xquery, existdb, and dhoxss.


After the Digital.Humanities\@Oxford Summer School 2011 my post-doc project went topsy-turvy. All those mighty Xs (XML, XHTML, XQuery, XPath, eXist, oXygen…) persuaded me to drop SQL and say goodbye to the command line.

Ah, wonderful memories of DHOXSS 2011 - I just came across this post about it from last year.  I wonder how depalaeographia fared?

Oh, and hello, world of tumblr!