Etext HomeGeneral InfoCollectionsServicesFeaturesStandardsContact UsQuestions?VIRGO

TACT: Text Analysis Computing Tools

About TACT 2.1.4

TACT (Text Analysis Computing Tools), a system of 16 programs for MS-DOS, is designed to do text-retrieval and analysis on literary works. Typically, researchers use TACT to retrieve occurrences of a word, word pattern, or word combination. Output takes the form of a concordance, a list, or a table. Programs also can do simple kinds of analysis, such as sorted frequencies of letters, words or phrases, type-token statistics, or ranking of collocates to a word by their strength of association.

TACT is intended for individual literary texts, or small to mid-size groups of such texts, such as Chaucer's poetry, Francis Bacon's Essays, Shakespeare's plays, Jane Austen's Pride and Prejudice, John Irving's The Cider House Rules, similar works in French, German, Italian, Spanish, Latin, and other modern European languages or languages using a roman alphabet, and classical Greek. Using TACT for large corpora can raise problems best handled by software like ICAME Lexa or Open Text Systems Pat.

Note that on Windows 2000/XP, the streamlined menu system (DOUGMENU) will not function correctly and users will need to run each subprogram separately as in TACT's older iterations. For some of the subprograms (like USEBASE) this does not pose a problem, but MAKEBASE, TACTFREQ, ANAGRAMS, COLLGEN, FCOMPARE, HSMS2TDB, MAKEDCT, MERGEBAS, PREPROC, SATDCT, TACTSORT, TACTSTAT, and TAGTEXT require one to use the forcedos option from the command line prompt. The correct syntax for this procedure is "forcedos c:\Tact\[subprogram name].exe."

TACT 2.1.4 Users Guide

Version 1.2 of the TACT package came with a printed Guide (Toronto, 1990) that described in detail the applications it contained, including TACT, MAKBAS, COLLGEN, MERGEBAS, and BUILDBAT. In Version 2.1.4, TACT and MAKBAS have been renamed as USEBASE and MAKEBASE, respectively.

UseBase [old TACT]: this is the basic search-and-display application at the heart of the TACT package. For best results from this program the text will need some sort of tagging (see below). Before a text will run in USEBASE it must be prepared as a textual database using the MAKBASE application described below.

MakeBase [old MakBas]: creates a USEBASE textual database (*.tdb) from a marked-up ASCII text. The default control file (default.mks) recognises angle brackets -- "<" and ">" -- as tag delimiters.

MergeBase: combines from two to four files generated by MAKEBASE into one large ".tdb" file.

TACTSort: takes an input ASCII file and sorts each line according to a user-specified key. The user can sort using a particular tab-delimited field or the full line.

Version 2.1.4 features a number of revisions to the existing tools:

Collgen: takes a USEBASE .tdb file and produces a list of all repeated phrases, that is, where a sequence of two or more words appears more than once. In version 2.1.4, COLLGEN has been substantially altered. It now allows the user optionally to produce a list of all repeating phrases and subphrases, or only the maximally occurring phrases. That is, if a subphrase occurs the same number of times as a larger phrase that contains it, the subphase will not be included in the list of repeating phrases. This eliminates much redundancy in the output.

Collgen also can produce a list of pairs of words that co-occur within a user-specified word span, in any order. A numerical value is associated with each pair signifying the statistical likelihood of the words co-occurring in such a fashion.

The size of the co-occurrence output file can be reduced by use of one of two optional input files. The user can supply an .INC (include) or .XCL (exclude) file, consisting of a list of words to be included/excluded from the output of pairs of words co-occurring. For example, one could use this feature, by providing an .XCL file consisting of prepositions and other function words. The output would exclude any pair consisting of one or two function words. Output can be produced with spaces or tab-delimited.

Buildbat: now uses the familiar panel interface. You may specify whether to use DEFAULT.MKS or any another .MKS file. The name of the batch file created by Buildbat is the same as the input .LST file but with a .BAT suffix.

TACT Version 2.1.4

In addition to the existing applications, Version 2.1.4 contains 11 new programs:

ANAGRAMS: a new program that produces a list of partial or complete anagrams for a given database.

TACTFREQ: a new program that produces a list of all words that occur in a given database, with their frequency of occurrence in one of three different orders (alphabetical, reverse alphabetical, and descending frequency).

TACTSTAT: a new program that produces type-token statistics for a given database.

PREPROC: a new program that produces a set of output files relating to an input source text. The first output file is a list of distinct words. The second file is a copy of the input file with all tags, non-retained diacritics, and continuation characters removed. The third output file is a listing of all lines with tags and continuation characters in them. At the end of this file is an alphabetical listing of all reference tags found in the text.

MAKEDCT: a new program that is used to build dictionaries. The input file is a list of distinct words produced by PREPROC. This list is compared to two optional existing dictionaries, to produce a dictionary for the given input file. This dictionary contains surface, lemmatized, and two other forms for each word as well as part-of-speech information.

TAGTEXT: a new program that supplements or replaces the word-forms in a text with the fields from that text's dictionary or ".DCT" file, which Makedct previously generated for you. It can add up to two tags for each word in the text.

SATDCT: a new program that will generate a satellite dictionary for a given tagged text. The dictionary will consist of an alphabetical list of distinct forms, in user-selectable order, along with the number of occurrences of each distinct form.

FCOMPARE: a new program that compares two ASCII files, separating similar and dissimilar lines or fields. The user may choose to compare whole lines, or a particular tab delimited field within the input lines.

HSMS2TDB: a new program that produces a textual database for Usebase from a input source text that has been marked up using the tagging scheme for the Hispanic Seminary of Medieval Studies (HSMS).

Tagging

Processing a text with TACT normally begins when the researcher tags or marks up an ASCII copy of the text. In most instances, mark-up helps the researcher do analysis afterwards. The researcher first uses a text-editor to insert these tags, usually within diamond-bracket delimiters. This mark-up helps one to refine word-selections and to provide reference citations to retrieved passages. TACT supports COCOA, a mark-up system based on angle bracket -- "<" and ">" -- delimiters. For most users this means that, with a little modification, the basic SGML tags will work in TACT. Refer to the TEI Guidelines and Etext Center in-house documents for information on text tagging.

Within TACT the researcher may also employ four programs, PREPROC, MAKEDCT, TAGTEXT, and SATDCT, to add tags to each word of the ASCII text. These include the word's lemma (the dictionary form of the word), part-of-speech, or conceptual label.

The TACT system is multilingual. In order to display foreign languages, it supports the extended ASCII character set of the IBM PC, and with other font-editing tools, its capabilities can be extended to other modern European languages, such as French, German, and Greek. (Hebrew, Arabic, Cyrillic, and languages such as Chinese are beyond its present design.) It supports multilingual analysis as well by allowing for proper alphabetization, convenient keyboard entry, and printing on devices that require special "escape codes" to produce non-ASCII characters -- even if these sequences are different from those that would be used to enter the character from the keyboard, or display it on screen.

Using TACT

Once the text is marked up, MAKEBASE converts it into a database for efficient retrieval. MAKEBASE invites the researcher to define, interactively, the alphabet and its collation sequence, special characters, and the reference tags used for markup. Use a word-processor or text editor to divide large texts into smaller files for sequential processing by a batch file you create with BUILDBAT. This batch file uses both MAKEBASE and a second program, Mergebas, to create a large textual database out of smaller ones.

After MAKEBASE creates the textual database (or .TDB file) out of the ASCII text file, a researcher may employ six programs to retrieve information from, or to analyse, that text.

Most researchers begin with USEBASE, which allows one to select a word, a group of words, or a word-pattern, and then to display it in five ways: a keyword-in-context (KWIC) concordance, a variable-context concordance, the whole text, an occurrence- distribution graph, and a table of collocates. The collocate table shows all words that co-occur with the queried word, words or word- pattern and orders those collocates by strength of association. Displays in USEBASE are linked so that, for example, the researcher can go directly from a position in a distribution graph to the text it represents. Any display may also be modified in various ways.

Working with the database, USEBASE can present a complete list of words from which a subset for retrieval may be selected, one word at a time. Through what is called "regular expression" capability, the researcher may also write a query according to a pattern of characters, including "wildcards" (for example, all words beginning with the letter "a" and ending with "ed" or "ing"). Queries may also contain refinements called "selectors" that specify (a) proximity or collocation (two or more words found together within a user-specified span of words), (b) similarity (in spelling), (c) frequency of occurrence, and (d) a condition related to whether or not words or patterns have one or more tag attributes in the markup. All queries may be kept in one or more ASCII files external to the program, from which queries may be selected; thus, for example, the researcher can construct a lexicon of words and expressions in such a file.

Once a set of words has been selected by whatever means, it can be saved within USEBASE as a "group". Groups can in turn be combined to form other groups. Thus, for example, all words and expressions the researcher regards as concerning the semantic field "earth" can be saved as the group "earthgrp" and then be combined with the groups "airgrp", "firegrp" and "watergrp" to produce the group "4elementsgrp". Group names can be included within queries as easily as words, so that, for example, a researcher could ask to see all passages in which "airgrp" words occur within two lines of "firegrp" words. Groups are really collections of "locations" in a text; and so groups are specific to one text. However, they may be saved in a group index (.GIX) file for reuse. Unlike groups, queries stored in an external file are independent of any one textual database.

When creating a group from a query, the user can examine all retrieved citations in the text and choose which to include or exclude. This ability to choose by context can eliminate homographs and produce lemmatized groups.

Four other TACT programs, like USEBASE, operate off the textual database. (1) Collgen lists all repeating fixed phrases and all node-collocate pairs (two words that occur more than once near to one another in the text). (2) TACTSTAT produces type-token statistics for word- length and word-frequency. (3) TACTFREQ produces alphabetical, reverse alphabetical, and descending-frequency word- lists. (4) Anagrams discovers anagrams of words in which the user has some interest.

Fcompare compares ASCII lines (optionally consisting of one or more tab-delimited fields) from two files and outputs three files that list which lines are shared and which not. Preproc can be used to generate the word-lists intended for input to Fcompare. TACTSORT sorts the lines of an ASCII file (optionally using a tab-delimited field as the key). All three of the above programs use the TACT sort order specified by an existing .MKS file, or by the DEFAULT.MKS file.

Most TACT-system programs will output lists, tables, graphs and other displays as ASCII files that can in turn be imported into database management systems, spreadsheet programs, and wordprocessors for post-processing of many kinds.

[Most of this helpsheet is taken from the TACTREAD.ME file included in the TACT 2.1.4 release]