The XML DAT can be used to parse arbitrary XML and SGML/HTML formatted data. Once formatted, selected sections of the text can be output for further processing.
One approach to parsing with the XML DAT is to read the XML with a Text DAT or a Web DAT, and pass that to a default XML DAT. Then you start to refine your selection by changing the match-all pattern (the "*"), to strings that reduce the elements that are in the output.
XML and HTML Background
XML and HTML data consists of a tree like structure consisting of elements. Each element can be either tagged or contain arbitrary text. Elements may be nested. Tagged elements begin with an opening section and usually are terminated with a closing section.
<greeting a="1" b="2" c="3"> Hello there. </greeting>
In the example above are two elements. The first is a tag element named greeting with attributes a, b and c. The second element is a text element consisting of "Hello there."
XML DAT Operation
The XML DAT begins by parsing its input, creating an internal tree of elements.
The Element Scope parameters are then used to filter out unwanted elements. The remaining elements are then used to create the output. The format of the output is determined by the Format parameters. The Output parameters can then be used to futher limit the information displayed for each scoped element.
Each parsed element contains a number of details:
Label - Each element is given an arbitrary label named n0, n1, n2 etc. All elements are children of the reserved element labelled 'root'.
Type - Elements are mainly of type 'tag' or 'text', though tag types can be further classified into 'doctype', 'declaration', 'comment' or 'entity'.
Text - The text of an element refers to the tag attribute of an element, or the arbitrary text contents. In the above example, the first element would be of type 'tag' and contain text of 'greeting'. The second element would be of type 'text' and contain text of 'Hello there.'
Level - This describes how deeply nested an element is. For example the single root element always has a level of 0.
Parent - Each element contains one parent. The root element does not have a parent.
Children - Each element can have an arbitrary number of children elements.
Attributes - Each tagged element can have an arbitrary number of attributes. Each attribute consists of a name and a value. In the above example, the greeting tag would contain 3 attributes (with names a, b and c and values 1, 2, and 3 respectively).
Parameters - Format Page
sgml - If enabled, the input should be in SGML/HTML format. This includes form data. If disabled, XML format is assumed.
merge - ⊞ -
- Before Element
- After Element
- Inside Element
- Replace Element
mlabel - Merge and label can be used to combine two inputs of data. The second input must be XML formatted, and not SGML/HTML. These two parameters control where and how the second input is merged.
Parameters - Element Scope Page
This section of parameters controls which elements are selected for output. By default all elements are selected.
label - Element labels must match this parameter.
type - Element types must match this parameter.
text - Element text must match this parameter.
name - If an element contains attributes, at least one must have a name matching this parameter.
value - If an element contains attributes, at least one must have a value matching this parameter.
plabel - Elements must have a parent whose label matches this parameter.
ptype - Elements must have a parent whose type matches this parameter.
ptext - Elements must have a parent whose text matches this parameter.
Parameters - Output Page
Once a selection of elements have been selected for output, its output can be further refined.
oaname - Only output attributes whos name match this parameter.
oavalue - Only output attributes whose value match this parameter.
oclabel - Only output children whose label match this parameter.
show - ⊞ - This controls how the selected elements are presented.
- Summary Table
- Summary Tree
sumtree- This output selection is similar to the summary table, except it outputs an indented ascii representation of the tree. It can be used to quickly identify areas of interest while picking appropriate parameters.
xml- This outputs an XML compliant tree of the selected elements. It can then be fed into another XML DAT for further processing.
- Attributes per Row
attribs- This outputs a table of all attributes for the selected elements. Each element attribute is output on a separate row.
- Attributes per Column
attribscol- This outputs a table of all attributes for the selected elements. Each element is output in output on a single row, where each column represents one attribute.
children- This outputs a table of all children for the selected elements.
text- This outputs all text contents from all elements of type 'text'.
lprefix - This determines whether or not the element label is prefixed when outputting tables or attributes or children.
Parameters - Common Page
language - ⊞ - Select how the DAT decides which script language to operate on.
input- The DAT uses the inputs script language.
node- The DAT uses it's own script language.
extension - ⊞ - Select the file extension this DAT should expose to external editors.
dat- various common file extensions.
- From Language
languageext- pick extension from DATs script language.
- Custom Extension
customext- Specify a custom extension.
customext - Specifiy the custom extension.
wordwrap - ⊞ - Enable Word Wrap for Node Display.
input- The DAT uses the inputs setting.
on- Turn on Word Wrap.
off- Turn off Word Wrap.
- Input 0 -
- Input 1 -
|• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •|