Text searching in the CELT database

CELT Project (MS image source: CPG 359 copyright Uni-Bibl. Heidelberg)

Home

About

News

FAQ

Published

Captured

Search

Languages

Contact

Resources

People

Research note

Text searching in the CELT database

Peter Flynn

University College Cork Computer Centre

June 2000

Abstract

The CELT database is a growing corpus of literary, historical, social, and political texts relating to Ireland and Irish culture. The text is encoded using the TEI DTD, and is available online in SGML, HTML, and plaintext format.
A simple search facility exists using an old version of PAT, but considerably more complex searching is possible, and likely to be demanded by the community.
This paper outlines the possibilities for complex searching, and invites comments and suggestions as to how this can be made useful and usable.

Background

The CELT project was launched in June 1997 to create an online corpus of text and other material of Irish interest. The texts are encoded in SGML according to the Document Type Definition (DTD) of the Text Encoding Initiative (TEI), and are available from the project web server (http://www.ucc.ie/celt/) in several formats:

HTML, for online reading and printing;
The original SGML, for analysis and research;
Plain unmarked text, for use in non-SGML systems;

The project has been fortunate in being able to make use of the texts made available in Cork by an earlier project in a similar field (CURIA, which finished in early 1997). As of the end of August 1997 the CELT project had just under a million words available online.

Work began on developing search software for the CURIA project, and has continued in the CELT project, using the PAT software from OpenText Corporation. It has now reached the stage where decisions on the nature of the interface require input from the potential user communities.

Text markup

The texts are marked at two levels:

structural (natural or man-made divisions such as chapters of books, years of annals, poems within anthologies, paragraphs, verses, lines, pages, folia, lists, items etc);
descriptive (personal names, placenames, dates, events, artefacts, numbers, etc).

All texts have the first type of markup applied, but work is not yet complete on some texts in respect of the descriptive markup. An example of a printed text, scanned into computer form, and the resulting marked text and Web version can be seen in Figures 1 to 4.

*Figure 1. The printed source edition*

Documents are scanned from printed editions (Figure 1) into plaintext files (Figure 2). Computers are unable to disambiguate the many and varied typographical nuances of printed texts unaided, so it is more reliable to add them back in as markup than to retain undifferentiated uses of italics, bold, underlining, etc.

Figure 2. The document scanned into a plaintext file


The Annals of Ulster

	  

431  Kl. Ienair                    . Anno ab Incarnatione Domini  H16ra

     .cccc.xxx.i. Palladius ad Scotos a Celestino urbis Romae 

     episcopo ordinatus episcopus, Etio & Ualerio consulibus, 

     primus mittitur in Hiberniam ut Christum credere potuissent, 

     anno Teodosi uiii.

The markup in SGML uses the specifications of the TEI, which defined a set of named elements suitable for encoding literary, historical, and other academic texts. The TEI scheme is the established standard for this, and is used by almost all similar textbase projects worldwide.

The principle is to enclose the relevant text in 'tags' made up of a name for identification, surrounded by angled brackets, with a slash distinguishing the end-tag from the start-tag: <date>27 August 1763</date>. In some cases below (marked with an asterisk) the CELT project has abbreviated the standard TEI names to ease the editorial process.

It provides for the division of the text into its component sections which can be labelled according to usage (Figure 3),for example:

<DIV0>, the outermost 'container', holding (in this example) an entire Annales;
<HEAD> to hold headings;
<DIV1>, one year's Annals;
<DIV2>, a single entry within a year;
*<PB> for pagebreaks;
*<MLS> for other milestones;
<P> for paragraphs.

Within paragraphs, the TEI provides for a large number of elements to identify significant words or phrases, of which a few are shown in Figure 3. In many cases the :

<DATE> for dates;
*<EX> for expansions;
<GAP> for gaps in the source;
*<FRN>for foreign words (ie those not in the base language of the document);
*<PS> for personal names;
*<FN> for forenames within personal names;
*<PN> for placenames;
<TERM> for other terms such as occupations;
<NUM> for numbers.

There are many more: for details see the documentation on the TEI.

Figure 3. The file marked up in SGML in TEI format

...

<DIV0 TYPE="Annals">

<HEAD>The Annals of Ulster</HEAD>

  <PB N="38">

  <MLS N="16ra" UNIT="folio/column">

  <DIV1 N="U431" TYPE="Annal">

    <DIV2 N="U431.0" TYPE="Entry">

      <P><DATE VALUE="0431-01-01">Kl. Ien<EX>air</EX>.<GAP><FRN LANG="la">Anno 

         ab Incarnatione Domini .cccc.xxx.i.</FRN></DATE></P>

    </DIV2>

    <DIV2 N="U431.1">

      <P><FRN LANG="LA"><PS><FN>Palladius</FN></PS> ad <ON

         TYPE="people:Irish">Scotos</ON> a <PS><FN>Celestino</FN></PS> 

         urbis <PN TYPE="city">Romae</PN> <TERM TYPE="bishop">episcopo</TERM> 

         ordinatus <TERM TYPE="bishop">episcopus</TERM>, 

         <PS><FN>&Eogon;tio</FN></PS> &ampersir; <PS><FN>Ualerio</FN></PS>

         consulibus, primus mittitur in 

         <PN TYPE="country:Ireland">Hiberniam</PN> ut Christum credere 

         potuissent, anno <PS><FN>Teodosi</FN></PS> 

         <NUM VALUE="8">uiii</NUM></FRN>.</P>

    </DIV2>

  </DIV1>

...

The TEI/SGML markup is ideal for making a permanent record of an electronic text, preserving whatever features of the original are deemed necessary, in a format that is protected by an International Standard (ISO 8879:1986) so it is not subject to arbitrary changes in manufacturers' proprietary software.

The World-Wide Web also uses SGML, but currently only a single Document Type Definition, HTML, the HyperText Markup Language. For the moment, therefore, the TEI files must be converted to HTML for viewing in an ordinary Web browser. Although the text is preserved in conversion, some of the information in the markup cannot be acted upon, as HTML is extremely simple and does not possess the descriptive power of the TEI DTD. Users of real full SGML browsers like Panorama or MultiDoc Pro or SGMLC can of course use the original TEI files without loss of information.

*Figure 4. The text converted to HTML for use in the Web*
The Annals of Ulster (Author: Unknown) p.38 {folio & column 16ra} Annal U431 Kl. Ienair. [ ] Anno ab Incarnatione Domini .cccc.xxx.i. Palladius ad Scotos a Celestino urbis Romae episcopo ordinatus episcopus, Etio & Ualerio consulibus, primus mittitur in Hiberniam ut Christum credere potuissent, anno Teodosi uiii.

Searching

The project acquired the PAT text indexing and search program at an early stage. This allows any file of text to be searched at very high speed, and it understands the concept of SGML markup, dividing the text into regions delimited by tags in angled brackets.

PAT is a text-mode terminal program, so for use in the Web a 'front-end' is needed. This is a UNIX shell script which the Web server can run: it takes information supplied by a user from a Web page form, feeds it to PAT as a search request while it is reading a file, and interprets the search results back into HTML for display.

Figure 5. Using PAT in a terminal window (commands typed by the user are in bold type)


imbolc% pat G100001.dd



       Pat Text Database System, Release 4.1

    Copyright 1987-1993 by Open Text Corporation



>> celes

  1: 4 matches



>> pr

   451616, ..</PN>, regnum celeste adipisci meruit post poenitentiam</FRN>.</..

    24727, ..ON> a <PS><FN>Celestino</FN></PS> urbis <PN TYPE="city">Romae</P..

   252512, ..la ad speciem celestis arcus <NUM VALUE="4">.iiii.</NUM> uigilia..

   571837, ..NG="la">Ignis celestis percusit uirum in <TERM TYPE="oratory">or..



>> [24727]

  2: one match

 

>> region P incl %

  3: one match

 

>> pr.region.P

    24632, ..<P><FRN LANG="la"><PS><FN>Palladius</FN></PS> ad <ON

TYPE="people:Irish">Scotos</ON> a <PS><FN>Celestino</FN></PS> urbis 

<PN TYPE="city">Romae</PN> <TERM TYPE="bishop">episcopo</TERM> ordinatus 

<TERM TYPE="bishop">episcopus</TERM>, <PS><FN>&Eogon;tio</FN></PS> &ampersir; 

<PS><FN>Ualerio</FN></PS> consulibus, primus mittitur in <PN 

TYPE="country:Ireland">Hiberniam</PN> ut Christum credere potuissent, anno 

<PS><FN>Teodosi</FN></PS> <NUM VALUE="8">uiii</NUM></FRN>.</P>..



>> quit

imbolc%

Figure 5 shows a PAT session in a normal terminal window:

the user types the pat command and the name of the document file for the Annals of Ulster;
the search term is (let us assume) 'celes';
PAT says there are four occurrences;
the user asks for them to be printed (displayed) with the pr command;
the second of them, starting at location 24727, is the one she is interested in, so this number is typed, using square brackets to mean 'find this one only';
she wants the whole paragraph which contains it; this is done with the 'region' command for the P element, set to include the result of the most recent search (symbolised by the percent sign);
this is then printed out with the pr command
and the run is terminated.

While this is not difficult, the technology belongs to the late 1950s and requires the user to be able to read a manual written in fairly technical terms, and learn the commands, as well as have an awareness of the principles of markup and the specifics of the TEI DTD. Many scholars do indeed possess these skills, but the interface offered by the World-Wide Web is the one expected by most computer users in the late 1990s.

The construction of a Web interface needs to be based upon the requirements of the user, with as few constraints as possible placed by the technology (some are unavoidable). So far, nine axes have been identified, along which users may wish to set the limits of their searching:

By single document, group, or corpus
Searching a single document is easy to implement for a small number of documents, in that you can create the searchable PAT index very easily. However, as the number of documents grows, the overhead in keeping everything up to date also grows.

By the same token, setting up an entire corpus is fairly straightforward, because there is, in effect, just a single file to index (although it is very large).

Searching groups of files, however, is problematic because it is impossible to predict what files each user would like to see constituting a meaningful 'group'. This is especially true in the field of history, where there can be significant disagreement between scholars about dating, provenance, genre, or authorship.

It would be possible to dynamically form filegroups on a per-user basis, deleting the temporary copies afterwards, but this requires astonishingly large amounts of disk space (upwards of ten times the size of the original files to keep the index, and up to ten times that for the temporary files); it takes a long time to index; and there is a difficulty involved on the Web in 'maintaining state' (knowing when the user has finished: they could simply turn off their computer without notifying the project's system).

The current solution is to keep only the corpus, but develop a way of allowing the user to specify arbitrary filegroups for each search, and using the facilities of PAT to restrict the search to those files only.
Parts of the text
A TEI file is composed of two major parts: the bibliographic and documentary header, and the text body. PAT, like other SGML search software, can distinguish between the two and search them separately or together.

Within both sections, information is either stored as markup, with its embedded attributes, defining the structure; or as text data (words enclosed in markup). This is shown diagrammatically in Figure 6.

The current assumption is that users will want to search principally the text in the body of the file, with some reference to the markup, so the four areas, ranged in order of importance, are (numbered in Figure 6):
1. text data in the body of the file;
2. markup in the body of the file;
3. text data in the header;
4. markup in the header.

Figure 6. Areas within a TEI file

TEI Header

Text Body

Markup



<TITLESTMT>

  <TITLE>                    </TITLE>

  <AUTHOR>       </AUTHOR>

  <RESPSTMT>

    <RESP>           </RESP>

    <NAME>



                        </NAME>

  </RESPSTMT>

</TITLESTMT>



<DIV2 N="U431.0" TYPE="Entry">

  <P><DATE VALUE="0431-01-01">

        <EX>   </EX>.<GAP><FRN 

     LANG="la">

                        </FRN></DATE></P>

</DIV2>

Character
data



                 

         The Annals of Ulster              

        Unknown               

                  

          compiled by       

          Donnchadh &Oacute; 

          Corr&aacute;in and

          Mavis Cournane





                              Kl. 

     Ien    air     .                   

               Anno ab Incarnatione 

     Domini .cccc.xxx.i.

Markup elements
Any SGML search system offers the possibility of restricting the user's search to specific elements of the text, as defined by the markup. Thus you can use PAT to search only names, or places, or verses, or list items, or dates, etc.

The problem confronting users of the CELT texts is the large number of elements involved, and the concomitant problem for the interface of how to present them most meaningfully and usably.

As conventionally described, there are two classes of element:
1. Structural: divisions of the text chapters, sections, annal entries, etc), paragraphs, verses, notes, bibliographic entries, etc;
2. Inline: names, dates, places, events, artefacts, occupations, etc.
The assumption (as yet untested) is that the scholar may need to be able to use any of these to modify the search. In addition, there is a perceived need to be able to restrict or extend the context returned as a result of the search. This is probably best expressed here for illustration by couching the search request in plain English:

Find me all occurrences of the surname Domhnaill, but only where they occur in lines of poetry; but in your results, show me the whole verse where the name occurred.

(Systems have existed for many years to handle requests in this manner, but they are beyond the means of most users). In raw terminal-mode PAT searching, this would be represented as shown in Figure 7. Further recursion would be needed to identify the textual division within which the stanzas occurred, and the text within which the divisions occurred, in order to identify the text itself and generate the canonical reference. This aspect is explained in more detail in the discussion on output style.

Figure 7. Markup-directed search in PAT

>> domhnaill within region L

  1: 3 matches



>> pr

  1160011, ..gon;n <PS><FN>Domhnaill Midhigh</FN></PS>, </L><L>ba dirsan do s..

  2667564, ..ceann <PS><FN>Domhnaill</FN></PS> í <PS><FN>Neill</FN></P..

  4566625, ..l n="106">Tri Domhnaill, Lochlain<ex>n</ex>, Aodh Bán, <l..



>> region LG incl %

  2: 3 matches



>> pr.region.LG

  1159821, ..<LG N="1" TYPE="quatrain">

<L>Aidhes <PS><FN>Brain</FN></PS>, olc fri taidi,</L>

<L>i <PN TYPE="church">Cill Chúile Dumhai</PN>,</L>

<L><PS><FN>Eithne</FN></PS>, inghęn <PS><FN>Domhnaill Midhigh</FN></PS>,</L>

<L>ba dirsan do suidhiu.</L>

</LG>..

  2667286, ..<LG N="13">

<L N="1"><APP><RDG WIT="H">Saoth</RDG><RDG WIT="Eg">Gaoth</RDG></APP> liom an ceann an bhúr laim.</L>

<L N="2">ceann <PS><FN>Spioláin</FN></PS> í <PS><FN>Shuiliobhain</FN></PS></L>

<L N="3">ni doilge leam ceann eile.</L>

<L N="4">ceann <PS><FN>Domhnaill</FN></PS> í <PS><FN>Neill</FN></PS> bhúighe </L>

</LG>..

  4566550, ..<lg n="27"> 

<l n="105">Amhlaibh, is da Tadg gan tár, 

<l n="106">Tri Domhnaill, Lochlain<ex>n</ex>, Aodh Bán, 

<l n="107">Rúaid<ex>r</ex>i, Art, tuirbhim gach tan, 

<l n="108">Muircert<ex>ach</ex>, is da Cathal; 

</lg>..



>>

Recognising or suppressing diacritics
Diacritic characters are represented in TEI markup by the standard ISO character entities such as á for á. This enables the text encoding to mean the same thing across all computing platforms, but requires SGML-aware software to display the correct symbol on the screen. In the case of the ISO Latin-1 characters, this is built into Web browsers. For most other characters, fonts are not commonly available for such use, especially for macrons, long-tailed e (e-caudata), the insular ampersand, etc.

A search engine like PAT requires the character to be entered literally as it exists in the text: á. A Web browser interface can accept the symbol á from the user and the search script can convert it to entity form for use, but research has shown that not all browsers honour characters typed in this way into form fields.

However, in many cases it is expected that users will want to type the unadorned character for search purposes, so that retrieval will represent all possible orthographic forms, and not just those bearing the exact diacritic. On the other hand, there will be users who explicitly do want to retrieve only the exact diacritic form, so both requirements need to be handled.

On the searching side, it remains to be determined if it is possible to construct a suitable set of commands to PAT which will correctly represent all variants of diacritic that are possible, when a user types a search term without any. This would mean being able to convert a search term such as 'ben' into

ben unadorned

bén bén

bên bên

bèn bèn

bën bën

b&etilde;n with tilde

b&ecaud;n with long e

b&emacr;n with macron

b&eogon;n with ogonek

b&eunderdot;n with underdot

for every occurrence of a vowel (and some consonants) in both upper- and lower-case. The impracticality of this approach is the foundation for the virtual necessity of using regularisation in searching complex markup.

The necessity is not absolute in the case of the handling of diacritics: modern search engines exist which can handle them because of the higher level of SGML sensitivity than the version of PAT which is at the disposal of this project, but the investment is significant. The necessity is absolute in the handling of spellings.
Single word, partial words, or multi-word phrases
PAT's default is to assume a simple search term should be sought at the start of a "word", which is assumed to be a string of characters delimited by white-space, markup, or punctuation. This is effective for English but not always useful for other languages, especially Irish, which displays significant initial mutation.

By indexing every letter, PAT can search everywhere, both within words as well as for whole words, and the user can direct the distinction by prepending and appending a space to the word to show which end (or both) is to be treated as delimited:
```
>> cool

  1: 4 matches

 

>> pr

  8829462, .. been made to cool the emigration fever by painting the fortunes..

  9258994, ..127">Sinead McCoole, Hazel: a Life of Lady Lavery, 1880--1935 (D..

  5694362, ..des vaches de Cooley (version du Lebor na hUidre, Ogam 15 (1963)..

  8299634, ..l n="360">The cooling brook, the grassy-vested green, <l n="361"..



>> " cool "

  2: one match

 

>> pr

  8829462, .. been made to cool the emigration fever by painting the fortunes..



>> 
```
Multi-word searches are handled by PAT's proximity detector. Two search terms can be separated by FBY (followed by) and NEAR (on either side), and the closeness can also be configured (default is 80 characters including markup). These words can be prefixed by NOT for 'not followed by' and 'not near':
```
>> grass fby green

  3: one match

 

>> pr

  8299653, ..ng brook, the grassy-vested green, <l n="361">The breezy covert ..



>> 
```
Deeper searches (one word followed by another followed by another etc) are also possible, but it may on occasion be preferable to save the result of one search and then re-search only that portion (see the item on retention of previous search results.
Regularised spellings
The TEI markup allows the specification of a regularised form of spelling for terms marked in running text. It is essential to note that this is not 'normalisation', which implies that the marked form is corrupt and a lemmatic form is being given: 'regularisation' makes no such claim, and is merely a means of providing a convenient single form for the recognition of variants as a group.

Regularisation such as <name reg="Patrick">patraic</name> provides search engines with the ability to let the user search for 'patrick' and have all existing variants retrieved, regardless of their spelling in the source. Careful markup of implied references such as <name reg="Patrick">apstal Erenn</name> improve the accuracy of identification...but the implementation of full regularisation involves very substantial expenditure of manual effort (although with careful pattern-match retrieval some of this can be automated).

With the extension of the potential user community for the CELT texts outside the field of Celtic Studies, History, and the Mediaeval communities, the availability of regularisation should be regarded as essential.
Output style: KWIK, formatted, with or without context or references
The default output from PAT is the KWIK (Key Word In Kontext) format familiar to users of many retrieval systems, where the term sought is aligned vertically with the short context spreading off to each side (see output of the pr command, Figure 5 and others).

This is simple to read and provides a useful 'first shot', but it gives no deeper context that the surrounding text, and (possibly fragmentary) other markup nearby. It is possible to extend the scope or span of the context provided, but only on the basis of displaying more characters.

For searches bounded by logic such as given in Figure 7, it is possible to display the whole of the enclosing element, and this will typically include the outer enclosing markup together with its numbering system (in Figure 7 this is the N value immediately following the <LG identifier).

In a Web-based search, once the enclosing markup has been identified, it is possible to convert it on-the-fly into HTML and display the result in formatted form (Figure 4). As HTML lacks the descriptive richness of TEI, much of the source markup has to be represented as typographic variation (bold, italics, indenting, etc).

Successive recursion of the search with ever wider element scope lets a search engine find a term, move 'outwards' to its enclosing element, outwards again to the element enclosing that, and ever more outwards until it reaches the outermost enclosure for the document, the <TEI.2> element which identifies the entire text. This method lets the search engine gather the relevant numeric or coded values on the way, and thus provide a formal canonical reference.

This is the method implemented in the experimental current search for the CELT texts. It has the disadvantage that PAT can only perform this referential search for each 'hit' individually: experiments to get it to perform the same process for a group of hits led to a processing time and disk space requirement beyond the scope of the current equipment.
Retention of previous search results for further refining
It is possible within PAT to store the results of a search on disk as a separate file, and keep them until a later date or subsequent search session, to avoid the need to perform the initial search all over again. While this is simple to do when logged in using a plain terminal interface, its use via the Web is more problematic for two major reasons:
1. Web browser access is inherently anonymous. Passwording identification systems can easily be installed, but they go against the spirit of openness implicit in a public research project, and imply that there is something secretive involved. They do of course have the benefit of enabling the project to limit access in the case where the server computer is becoming overloaded with (possibly trivial) requests to the detriment of serious research, and that possibility must be borne in mind.
2. The HTTP protocol used on the Web is 'stateless': it does not of itself permit one request to 'know' that it follows as a direct result of a previous request to the same machine. This can be overcome by careful programming of the search pages and the script which drives PAT, but the delay times still found on the Internet, especially from remote locations, do not permit a server to know if one request is being regarded by a user an an extension of a previous one, or as an entirely new search.
Retaining search files is therefore not a practical proposition, unless limitless disk space were available, or unless a strict timeout were imposed so that such files would be deleted automatically after a fixed period (an hour, a day, a week).
Overnight processing of long tasks, with results sent on by email
One solution, for users willing to learn the PAT syntax, is to grant login terminal access to the computer where PAT runs. This was originally envisaged as the prime solution for the researcher who needed long-term in-depth access, and it is still possible: although for security reasons some proof of the user's status in the field is required, this should not be a hindrance for bona fide scholars.

The alternative is to enable users to submit a file of PAT commands from a Web browser or email system, and have the computer do the search in its own time rather than there and then, and return the results in an email message.

This still requires the user to learn the PAT search syntax, but as is evident from the figures in this report, this is not difficult for the experienced computer user.

Conclusions

It is clear that there are many questions needing answers before a suitable search system can be introduced. Experiments with interfaces will continue, but suggestions and comments from potential users are welcomed.

If you feel you can contribute, please see the contact page.

UCC

Research note

Text searching in the CELT database

Peter Flynn

University College Cork Computer Centre

Abstract

Background

Text markup

The Annals of Ulster (Author: Unknown)

Annal U431

Searching

Conclusions