HTML Markup | JavaScript | Java | Home & Links

Tutorial 7 - String Manipulation

String manipulation forms the basis of many algorithms and utilities such as text analysis, input validation, and file conversion. This tutorial explores some of the needed basics. Unless otherwise noted, the following classes are contained in the java.lang library.

NOTE: For the following parameters the prefix g indicates string, i indicates integer and c indicates character types.

The String Class

String class objects work with complete strings instead of treating them as character arrays as some languages do. Note: Convert variables of type char to string objects by using gStr = Character.toString(c);.

Accessor methods: length(), charAt(i), getBytes(), getChars(istart,iend,gtarget[],itargstart), split(string,delim), toCharArray(), valueOf(g,iradix), substring(iStart [,iEndIndex)]) [returns up to but not including iEndIndex]

Modifier methods: concat(g), replace(cWhich, cReplacement), toLowerCase(), toUpperCase(), trim().
Note: The method format(gSyn,g) uses c-like printf syntax for fixed fields if required in reports.

Boolean test methods: contentEquals(g), endsWith(g), equals(g), equalsIgnoreCase(g), matches(g), regionMatches(i1,g2,i3,i4), regionMatches(bIgnoreCase,i1,g2,i3,i4), startsWith(g)

Integer test methods: compareTo(g) [returns 0 if object equals parameter, -1 if object is before parameter in sort order, +1 if otherwise], indexOf(g) [returns position of first occurrence of substring g in the string, -1 if not found], lastIndexOf(g) [returns position of last occurrence of substring g in the string, -1 if not found], length().

String class objects are immutable (ie. read only). When a change is made to a string, a new object is created and the old one is disused. This causes extraneous garbage collection if string modifier methods are used too often. The StringBuffer or StringBuilder class should be used instead of String objects in these cases.

Warning: Since strings are stored as a memory address, the == operator can't be used for comparisons. Use equals() and equalsIgnoreCase() to do comparisons. A simple example is:

String aName="Roger";String bName="Roger";
if (aName==bName){System.out.println('== worked')};
if (aName.equals(bName)){System.out.println('equals worked')};

Here is a program fragment to validate characters in a string, possibly from data entry or from a file. Regular Expression techniques can also be used for validation.

The StringBuffer Class

StringBuffer class objects allow manipulation of strings without creating a new object each time a manipulation occurs. Examples of setting up a string buffer variable are:

StringBuffer defString=new StringBuffer(); // sets size to 16 chars
StringBuffer nulString=new StringBuffer(6); // explicitly sets size
StringBuffer aString=new StringBuffer("start value"); // sets value

Accessor methods: capacity(), charAt(i), length(), substring(iStart [,iEndIndex)])

Modifier methods: append(g), delete(i1, i2), deleteCharAt(i), ensureCapacity(), getChars(srcBeg, srcEnd, target[], targetBeg), insert(iPosn, g), replace(i1,i2,gvalue), reverse(), setCharAt(iposn, c), setLength(),toString(g)

The StringBuilder Class

StringBuilder class methods are similar to StringBuffer ones but they are unsynchronized (ie. not for multithreaded applications). They are also much faster. Examples of setting up a string buffer variable are:

StringBuilder defString=new StringBuilder();  // sets size to 16 char
StringBuilder nulString=new StringBuilder(6); // explicitly sets size
StringBuilder aString=new StringBuilder("start value"); // sets value

Accessor methods: capacity(), length(), charAt(i), indexOf(g), lastIndexOf(g)

Modifier methods: append(g), delete(i1, i2), insert(iPosn, g), getChars(i), setCharAt(iposn, c), substring(), replace(i1,i2,gvalue), reverse(), trimToSize(g ), toString(g)

String Tokenizers [java.util library]

Many text manipulation utilities require a tokenizer function which parses or breaks up the text into subunits called tokens based on specific delimiters or break characters. The most common delimiter is whitespace which yields words as the tokens. The String.split(string, reg_exp) method allows regular expressions to be used to define the delimiters. Java also provides several tokenizer classes including StringTokenizer (for strings) and StreamTokenizer and Scanner (for streams and files).

StringTokenizer class objects may be created by one of three constructor methods depending on the parameters used. The first parameter string is the source text to be broken at the default set of whitespace delimiters (space, tab, newline, cr, formfeed). If a second parameter is passed, that string is assumed to be the set of delimiting characters. Use the escaper \ character when representing the string quote character " or any non-typeable delimiters such as tab (\t). If a true flag is added as a third parameter, any delimiters found are also returned as string tokens. The StringTokenizer methods are: int countTokens(), boolean hasMoreTokens() and String nextToken().

Regular Expressions [java.util.regex library]

Regular expressions are a way to describe a set of strings based on common characteristics shared by each string in the set (ie. by pattern matching). They can be used as a tool to search, edit or manipulate text and data. One common use is validation of data entry strings. The regular expression classes (Pattern, Matcher and PatternSyntaxException) are found in the java.util.regex package which must be imported.

Java regular expression pattern syntax is similar to the syntax used by perl. Good references are found at oracle.com and RegExp.info. Some simple examples of the use of regular expressions are:

Pattern p=Pattern.compile("a*b");
Matcher m=p.matcher("aaaaab");
boolean b=m.matches();

As a convenience for a one-time use situation the matches() method simplifies the syntax (but does not precompile the pattern).

boolean b=Pattern.matches("a*b", "aaaaab");

As an aid to understanding the syntax of regular expressions and as a development tool, I strongly recommend downloading the oracle test harness.

String Applications

Analyzing, searching and reformatting text is often required in task automation. Simple analysis could be used for word counts, concordance, spell checking, reading difficulty analysis, input validity, etc. Searching is done to find relevant terms. Reformatting of text for new file formats is a common task.

Many tasks involve parsing text using either field positions or delimiters. A useful task would be to create reusable methods for txtAt(start,end), wordAt(start,length), txtBefore(delim) and txtAfter(delim). These methods would return null if linereturn was reached before the delimiter or if position parameters caused a length violation. Each method would use some of the string class methods mentioned above.

Simple processing such as removing unwanted characters from strings or translating them is an easy first exercise in string manipulation. cleanchr1 and cleanchr2 show two approaches to simple string manipulation.

wordValue puts a numeric value to words based on letter value (a = 1, etc).

permDemo is another simple app that permutes characters in a string by recursion.

Projects

Note: The following projects in this tutorial will be reused several times as other important topics are introduced. For now the workspace for each project will require hard coding of several lines of data into a string array. And each of the projects use the MVC design concept for factoring working code (model) from display interface code (view).

Project: Word Counting

Word frequency counting utilities use 'parsing' or tokenization to analyze text into counts of each distinct word unit. This has many practical applications in spell checking, document difficulty analysis, cryptology and file compression.

Set up a getData() method that reads data from an array of strings. This method will be replaced once file i/o has been covered. Use the string tokenizer example. Rework it to call getData() for the text. Add a second parameter to the tokenizer that explicitly specifies all required delimiters (include punctuation but not the hyphen or apostrophe). Lowercase (desense) all parsed words. Add a linear search for the word in a word array prior to adding it. If found, increment its count in a parallel count array else add it to the list with a count of 1.

The report should display each word, its word count and word length. Use a fixed field format that allows sorting of the report by another utility. A suggestion is ##### ## wordstring. Review the DecimalFormat Class for formatting specifics. Include a footer with report creation date formatted using the DateFormat Class and a total of the number of unique words. Format the trailer to be at one end of any sort.

wordCount1 gives you a good start on the project. It will be extended with a text file io class and a GUI to produce a word count reporter. After file IO has been covered, collections can be used to store the word lists so that array size is not an issue.

Note: wordCount and wcPlus are not like the unix utility wc which simply counts the total number of words in a text.

Project: HTML Analysis

HTML Analysis uses pattern recognition to identify elements, tags and attributes within text documents. Because tokenization is awkward for HTML, it is easier to build an ad-hoc pattern recognizer to spot both tag and attribute formats. This takes care of the <p class="x"> or <hr/> forms of tags. Regular expression classes can be used to make the pattern recognizer very compact.

Count the number of each type of tag. Check for matched tags (and nesting if you can). Output each attribute and its value. Take special care to watch for spaces in attribute equates and tags wrapped on multiple lines. XCheck1 gives you a start on the project. It will be extended with file io. A later case study adds a GUI. Sleuth, a utility to validate internal links also uses the ad-hoc parser.

Note: Those readers who have special interests in HTML applications may wish to explore the htmlparser project. Others may want to see how the Swing HTML parser in JEditorPane can be accessed.

Tutorial Source Code

Obtain source for cleanchr, permDemo, RegexTestHarness, SeeMethods, word, value, wordCount1, XCheck1, etc. here.



JR's HomePage | Comments [jatutor7.htm:2014 04 04]