Text Segmentation Tool | Exon Domesday

Please note that the Text Segmentation Tool causes direct changes to the database content and so is not available to users outside the project team. The documentation here is provided to show how the tool works for interest and for transparency about project methodology.

General principles

This tool was developed to help extract patterns from entries. You can define expressions for the patterns you would like to extract from the entries. Each pattern has a title, a key and an expression. The expression looks like the portion of text you want to find in the entry but with some special syntax to deal with variations. The syntax is described below.

All patterns are directly saved to the database and shared by the team.

Each time you modify a pattern, the tool parses all the entries, normalises the text (more on that below) and search for each pattern within the entry. The process is very quick (less than 2 seconds) so it is easy to keep making small incremental improvements to the pattern expression and check how well it matches the text until it covers enough variants.

Pattern Expression Syntax

The pattern expression looks like the text you want to match. Some characters have a special meaning that allow you to match occurrences that vary a bit from your expression.

Construct	Meaning	Example pattern	Matching occurrences
(term1\|term2)	alternatives terms	before (some\|words) after	before some after before words after ~~before after~~ ~~before some words after~~
(terms)?	presence of bracketted expression is optional	before (some words)? after	before after before some words after ~~before some after~~
%	any sequence of characters (except spaces)	w%	w word wonderful ~~before word~~ ~~word after~~
<key>	a reference to another pattern, either yours or one internal to the tool	before <number> after	before ix after before aliam after
<number>	a number	<number>	ix dimidiam
<hides>	measurement of an area	<hides>	iii hidas et ii carrucas
<peasants>
<livestocks>
<moneys>

	More advanced syntax
c?	presence of the preceding character is optional	wo?rd	word wrd rd ~~wooord~~
c*	preceding character can appear any number times (including none at all)	wo*rd	word wrd wooord
c+	preceding character can appear one or more times	wo+rd	word ~~wrd~~ wooord
.	a single character (including space)	w.rd	word w rd ~~woord~~ ~~wrd~~
^	the beginning of the entry	^word

Text Normalisation

The text of the entries is normalised to reduce variations that would complicate the pattern expressions.

Transforms:

v is converted to u
7 is converted to et
all punctuation signs are removed
?interlineation
?marginalia

Possible methodology

One way to proceed with the construction of the patterns is to work with one formula at a time. Copy a frequent instance from the one entry and paste it in a new pattern definition. Then click 'Update' to save the pattern definition and search for occurrences. Look at how many occurrences were not found (the number next to the pattern key in the list at the top). At the beginning the number might be quite high because the pattern expression is exact. The objective is to iteratively broaden the pattern to accept variants while making sure it doesn't become too broad. Change the pattern condition to 'Must not have' to bring up entries that may contain variants not captured yet by your expression. Look at one entry and use the pattern syntax to broaden the expression and include the formula found in that entry. Search again, check that the number not matched is going down. Repeat the process until you are confident your expression captures most variants of your formula.

You'll end up with one-off variants such as the odd term appearing once in your formula and it's up to you if you want to continue modifying the pattern expression to include them or keep them out. It might be worth comparing those exceptions with the image of the entry to see if the difference was introduced by us.

Then move on to the next formula using the same approach.

User Interface

Pattern list

The patterns are listed at the top of the web page. To select a pattern, click its name, it will be highlighted in yellow and an editing form displayed below.

To create a new pattern, click 'new-pattern', change its key to something else than 'new-pattern' and provide a pattern expression.

Pattern editing form

The form displays the properties of the selected pattern:

Key: a concise label for the pattern. The key is mainly used to easily reference pattern from other places. Therefore it is best not to change the key after it has been used elsewhere.
Title: the name of the formula. You can change this as much as you like.
Condition:
- May have: the tool will try to find this pattern in the entries
- Must have: the search results will only contain entries which contain this pattern
- Must not have: the search results will only contain entries which do not contain this pattern
- Ignore: the tool will not try to find this pattern in the entries, the search result is not influenced at all by this pattern
Pattern: an expression that looks like the piece of text you would like to extract from the entries

You need to click the 'Update' button for your changes to be saved to the database and the search results regenerated fromt the new values.

Search parameters

Text Range: e.g. 25a1-30a2;50a1 . To limit the search to a range of entries.
Show first: a number N. The search result will show only the first N entries.

The tool searches only the selected range among the Fiefs entries. The 'show first' only influence the number of entries displayed on the page but not the number of entries the tool search into.

Search statistics

E.g. 2216 units found among 2519 in the selected range (87%).

Means that there are 2519 entries in the selected range and 2216 entries match the conditions you have defined for each pattern.

Search results

List the first N entries among the Fiefs entries that match your Text range and your pattern conditions.

For each entry:

The entry number
The entry text: this is the normalised text. Any matching pattern is highlighted in green.
The patterns extracted from the entry. If the pattern was not found in that entry, the pattern expression is shown in orange and the key in red