![]() |
JR's HomePage |
Java Home | Site Map |
Page Bottom Crawler Tools | Document Includers | Survey Form HTML Generators | Pending Projects | Java Projects |
Java Web Focused Projects
This page offers some useful web development focused projects that I have developed or modified. Most have source code that has been compiled, tested and documented. Feel free to download and adapt the code to your own needs. Feedback on usefulness, bugs and suggested enhancements is appreciated.
Some projects remain undeveloped and are marked as pending. If you are interested in obtaining the current code for any of the pending projects, looking at what I am doing or pushing me to complete a specific project, please use the button at the bottom of the page. You can also check my general projects page for more stuff.
Crawler Tools
Crawler is a generic bot that scans a local folder gathering data from XTHML/CSS documents for specific user needs. It can be adapted to various data collection uses such as meta elements (for search engines), links, suspicious elements and attributes or specific text. Web-based bots or spiders can also be used as site-rippers. Note: The current version of Crawler does not crawl through script files, included style files, subfolders or external documents.
Mapper uses the crawler to gather all resource files into a report. It can be used to create a sitemap document or to find obsolete files hanging around in subfolders. Note: Does not find urls inside url(...) value yet!
Sleuth uses the generic crawler to check local target (ie. anchor) elements for dead links. It scans documents, making lists of both links and targets (including id= ones) on a page. Once all linked pages have been analyzed, matches are made and unresolved links are flagged. A second report indicates duplicated target errors which rarely get spotted by linkcheckers or even by html validity checkers! Not checking local id= links and duplicate targets are major failings of Xenu which is otherwise a fantastic external link checker.
Tojsdb uses the crawler to walk through a site, extract data from meta 'description' and 'keyword' elements and build a database that is in the Tipue search engine format. Other formats can be added on user request. View the Site Searcher project to see how you can add a client-side search engine to your site.
Spif uses the crawler to walk through a site, looking for deprecated elements and attributes. Spif also spots areas where style rules should be used instead of older attribute techniques. spif_el.txt and spif_at.txt are user adaptable files. You should have Spif on hand if you do webpage design.
cssSleuth uses the crawler to check all classed elements to make sure that the class has been previously defined in a stylesheet or in the local page's style element. pending
Document Includers
DocAdd scans html documents and adds additional blocks of text from files based on predefined insertion points (ie no markers are required!) doctype.txt overwrites all text up to the html element. head.txt will allow superceding any head element (such as title or meta). header.txt inserts text at the top of the displayed document and footer.txt inserts text at the end of the document. This utility overcomes the lack of such block inclusion mechanisms in some of the major html report generators. A simple search for specific beginning or ending tags or strings is made. The relevant file is then included if it exists. Filename globs (* and ?) can be used.
Insert scans documents and inserts blocks of text from a file based on user specified marker strings. The markers are not touched. zInsert.txt is the text block for inclusion and/or overwriting. zInsert.mrk specifies both the beginning and the ending marker lines. Source document marker lines must contain only these strings (and possibly whitespace) to be flagged! The first pass through the document checks for 'end missing', 'not touched' and 'queued file' depending on marker matching activity. It then places any queued file in a 'backup' subfolder. The second pass inserts the relevant text in any queued file. Switches are check (markers) and debug (messages). Filename globs (* and ?) can be used. Lessons learned from the active use of Insert are embedded in Insert2:
- Simplicity: A separate marker file was unneeded. Insert2 reserves the first two lines of merged file for markers.
- Generality: Insert2 can insert blocks from any file (ie. the source name is not hard coded). This allows several blocks to be inserted into the same folder (example both headings and menu structure).
You may find that the unix-like include utility is more appropriate for your site. It uses a single marker that is more flexible but requires a source/destination split of folders.
Survey Form
Survey.java allows selecting webpage survey form options using radio buttons and then displaying the current status of the survey using a bar graph. The survey results are stored in an object stream file.
Survey uses a simple bar graph class called Plot. Plot can be examined as an example of one possible way of implementing horizontal bar graphs with variable length filled rectangles. One extension to this project that would give a better look is to add bar shadowing by redrawing with x&y offsets and either a black or lighter shade of the bar color. Other obvious extensions are to generalize the data entry mechanism and to add a vertical drawing option.
For those looking for graphs and charts with a professional finish, check out JFreeChart.
HTML Generators
HTML generators come in two distinct forms. Some generate documents from a database maintained with a highly specialized editor (such as a genealogy program). Others generate from specially formatted raw text files. Some features to expect in a quality generator are:
- A !doctype element that shows the standard the generator conforms to.
- Code that checks out ok with the w3.org validation service.
- No blank URLs in links as some browsers (read MSIE) react strangely. A good workaround is href="#".
- Substitution from problem characters &, ", < and > into entity units.
- Inclusion of alt="[url_reference]" in all img elements.
- Provision for inclusion of blocks of text in the head element as well as headers and footers to the body contents.
One special situation is generating an HTML document while retaining all column alignment spacing including original widths without any new wraparounds in the HTML document. Normally this could be accomplished with the pre element tags. But some browsers do not interpret the pre element correctly. And if the data happens to have < or > symbols they will cause interpreter problems. T2H uses copyline.java, the break element to terminate lines and entities for problem characters. The code isn't pretty but it is fully HTML 4.01 compliant and allows sending report forms to be read by users with browser only capabilities.
Another special situation when generating an HTML file from a text file is the need to convert a column aligned file that is a set of web addresses into a valid HTML list file with hyperlinks. This Java utility is pending but the BASIC utility 2HTML is still available. A specification file is used to define appropriate field zones as well as the input file to be converted. Optional preamble and postamble files can be specified to add HTML code to the beginning and end of the data file output. You can also use this utility to strip columns from an ASCII file by requesting HTML=no and using the remarks field only. The 2HTML zipped file contains: the executable, specification, preamble, postamble and sample input files. After unzipping, run 2HTML /HELP to view the HELP screen. You may wish to check Model Railroad Layouts for a sample of the conversion. Download 2HTML.ZIP [FREEWARE - 63kb].
Pending Projects
Text files can be analyzed for duplication, word counts, banned or special words, grammar, structure style and various linguistic studies. Some of my previous work has been with cipher analysis. An interesting area for researchers is analyzing books written over a long span by an author looking for hints of Alzheimer's disease. Lack of word variety, phrase repetition and the use of indefinite words such as thing, anything or something can be indicators. Tool constructors can seek part-time employment with university Humanities departments here! pending
Duplicated files waste file space. They can also cause update problems! An easy project is to use a bot or crawler to scan a disk, looking for files with same/similar names or with similar file sizes. These can be further scanned with checksum results. pending
A suggested project is to remove or flag specific words in a source file using various dictionaries. Output would be an altered file and a file with the removed/flagged words. Use parsing, file io and collections for ideas. pending
- foul language dictionary
- Words replaced with xxxx
- Example: She was so damn annoying. --> She was so xxxx annoying. - common names dictionary
-** put in front of names
- Example: John was a good boss. --> **John was a good boss. - job specific dictionary
- A dictionary that changes with each job.
- Words replaced with xxxx. Same as #2.
A
crossword puzzle maker builds a word puzzle grid from a spreadsheet
or database of entries (words) and clues (definitions). The design
should be in standard American format consisting of a square
grid (not free format), all letters checked (ie used in both directions),
180° symmetry and white squares orthogonally contiguous. Many solutions
already exist for free format puzzles but the format found in papers
leads to much more interesting algorithms. Start by creating a GUI for
data entry. Next do some basic checks like letter counts (no entry can
exceed puzzle width, total letters must be less than width squared)
and letter checked (every letter must be in at least two entries).
Note: More hints to come. pending
Web Projects Source Code
Obtain source for the crawlers, includers, survey form and html generators here.
