xml beyond the hype

I was at a workshop today in York, run by the Archaeological Data Service, entitled XML for archaeologists: Beyond the Hype. I went along because I felt that I didn’t really understand how XML works, and have to say I was very pleasantly suprised because I absolutely loved the workshop and came away feeling that I’d made the conceptual link that I needed in order to understand XML.

My problem was that I was confused by the idea of a “language”. I understood that in XML you defined your own markup tags, and that somewhere there was a schema that explained what those tags actually meant. My sticking point was that I couldn’t figure out how you explained what something was, without relying on a whole bunch of other things that you’d need to explain as well. How would you explain to a computer what an elephant, or should I say an , is?

What I now see, or what makes sense to me (ie not necessarily the truth but good enough for me to work with), is that it’s actually more like a grammar than a language. It defines objects , but only in terms of rules and relationships. In a spoken or written language, that would be like knowing how to conjugate a verb or use it in a sentence without ever knowing what that verb meant. In XML, you don’t have to explain what an is, but you can define that it has a , four , two and so on. The pc doesn’t have to understand what any of those things actually are, but it understands the relationship between them, because you’ve defined that in your schema. Once I made that connection (sorry to those that think I’m terribly slow) I really enjoyed the workshop.

What also became clear, is that archaeologists, and in fact anyone who has to classify objects and assign behaviours to them, fundamentally understand XML even if they don’t realise that they do. The key is to try and persuade people to codify those objects and their behaviours, and to get them all to use the same language/schema to describe them.

There are some schemas available in archaeology- in the UK the best known is MIDAS XML but there’s also ArchaeoML. Some people have argued that they don’t go far enough, partly because there isn’t enough formalisation of terms of reference such as for colours. To paraphrase a colleague’s analogy, the colour “black” has a completely different meaning if you are in greyscale, monochrome or full colour. However, I can’t believe that these issues haven’t been considered before, and in fact most can be avoided by ensuring that everyone uses and understands a common vocabulary. MIDAS XML, which is the schema I am most familar with, comes with a set of thesauri and word lists that should help get around this issue, and if in doubt there are much larger models that we can call upon.

One of the most interesting presentations that I attended today was about the TEI- or Text Encoding Initiative, which aims to provide a toolkit for encoding literary texts and other information for online use. Pretty much everybody present at the talk could immediately see a use for this in dealing with archaeological grey literature. This is the huge mass of unpublished reports that commercial archaeolgical units produce each year. It is a notoriously difficult resource to quantify, let alone search, as most units have neither the time or the money to make this information available in any sensible form. Currently the job of trying to quantify this resource falls to the Archaeolgical Investigations Project, based at Bournemouth University. I can quite understand why the AIP need to exist, but currently their main method of data collection is to send people to visit units and read all of their reports for a year and type the data into a database. They need to do this because they can then have a consistent method of recording the pertinent details about a site, and what was found, separately from the remit of the unit, which is to fulfil the terms of the archaeological brief. If units could be persuaded to adopt a common schema and methodology for marking up their reports as the TEI has demonstrated, then surely this need to actually go and visit every unit in the country could be avoided? After all, broadband is a lot cheaper than train tickets! Marking up content does take time and would add to the cost of each project, but (presumably) this could be offset somehow by the cost savings on the AIP project for English Heritage.

None of this is rocket science, I’m well aware of that, it’s just getting enough people on board and coming up with a way forward that we’re all happy with…