Database Conversion Tools

From time to time I need to convert tables from one database management system (DBMS) to another. Mainly I convert to a format usable with PostgreSQL, but these tools should be adaptable to other DBMS, too.

Ideally, the source DBMS has a "dump" or "export" feature that can write the contents of each table as a text file, one line per row, with a special character (typically a TAB) separating the columns. If not, it's an easy thing to generate such a file using SQL. Like most DBMS, PostgreSQL can readily import a delimited file.

Unfortunately, some DBMS can't dump correctly. In other cases, the personnel controlling the data are unable, or unwilling, to deliver a clean, delimited file. I've had to develop some Bourne shell scripts to handle situations I encounter.

Column-Aligned Records
Microsoft Access Exports

Column-Aligned Records

Popular DBMS like Oracle are certainly capable of generating delimited text file dumps, but many legacy systems have been set up to deliver records with column-aligned fields; perhaps this is a holdover from punch-card days. If you're already getting the data you need, there's no need to pester the DBA to deliver the same data in a new format, just use the script parse-columns.sh to convert the column-aligned fields to delimited records. All you have to supply is the column widths and the delimiting character (default TAB). The script works as a filter, taking the column-aligned data from standard input and writing the delimited records to standard output:

parse-columns.sh

Converts records with column-aligned fields (from standard input) to delimited records in standard output. Trims leading and trailing spaces from each field.

Usage:

cat file | sh parse-columns.sh [ options ] col [ col ... ]

Where: col values are the columns that end each field

Options:

-d delim: Specify the delimiter to separate fields in output; legal values are TAB(default), slash (/), vertical bar (|), comma, and semicolon. Be sure to quote or escape the delimiter character.
-h: Display a help message listing options.

Microsoft Access Exports

The Access Legacy

In the past, database management was expensive to acquire and maintain. In-house "bread-and-butter" applications and services that keep a typical business going but don't directly generate revenue typically couldn't justify the expense of a product like Oracle and a server to run it. Instead, they did without, filling file cabinets with paper and annoying employees with piles of redundant paper forms. Easy-to-use "personal database" tools like Access offered an irresistible temptation: why not implement office-management applications in Access?

A profusion of applications arose, first to assist individual accountants or administrators, then shared among many users, each of whom ran a complete copy of the entire database on his/her own workstation. A new generation of in-house experts came into being, finding ways to bend this tool to meet real needs, despite the inescapable fact that Access, like all "office" applications, was designed for a single user, physically sitting at the workstation running the tool. Multi-user support, concurrent access, remote access and, eventually, web-based access, all had to be retrofitted. This helped create a legacy of applications and services that eventually outgrew Access, which required expert attention to work under the strain.

Today, production-quality DBMS, and the platforms to run them, are available free on any typical GNU/Linux CD. More and more in-house "bread-and-butter" applications and services are now web-based, using the Apache webserver, PHP for the code and user interface, and PostgreSQL or MySQL as the DBMS. But somebody has to migrate the data from the legacy Access apps; typically, this is up to whoever develops the new LAPP or LAMP services, like you and me.

The Access Export Problem

Astonishingly, the Access table export feature is broken; it seems that the highly-capable developers at Microsoft inexplicably overlooked the need to "escape" end-of-lines and other special characters inside text fields so the importing DBMS (even Access itself!) doesn't confuse them with field or record delimiters. Even worse, the export file often simply omits large sections, beginning in the middle of one record and ending in the middle of another, many records later. There is no way even a painstakingly careful import tool can recover data which was never exported in the first place.

The Access Export Solution

After much experimentation, with wailing and gnashing of teeth, I've found a solution. It appears that Access can successfully export tables to the HTML format without losing or corrupting data, in a format that is readily parseable. The Microsoft developers apparently realized that the average user could readily tell if the HTML was broken, and took care to escape all special characters.

To convert, you must open the database "backend" in Access (currently F11, or shift when opening the file), right-click on each desired table, select Export, and then select HTML under "Save as type". Naturally, you'll need to move these HTML files to a platform capable of running shell scripts and tools like sed and grep; presumably, that's the machine that will run your new system.

Once you have your exported HTML files, use the Bourne shell script html2tabbed.sh to convert them to tab-delimited format. The script works as a filter, taking the HTML data from standard input and writing the delimited records to standard output:

html2tabbed.sh

Converts an HTML table dump to tab-delimited format suitable for import into a database in PostgreSQL or other DBMS. Assumes input has the form:

Table row (tuple)	=>	HTML table row <TR>...</TR>
Table cell (column)	=>	HTML table cell <TD>...</TD>

This is the format that Microsoft Access uses when exporting a table as HTML. HTML tags may contain attributes and need not be uppercase. Filter ignores any markup not contained in table rows, and replaces certain HTML special characters with DBMS-safe equivalents:

This:	Becomes this:	Meaning
\	\\	backslash
<BR>	\r\n	CR, LF (DOS end-of-line)
"	"	double-quote
&	&	ampersand
>	>	greater-than
<	<	less-than
	\t	TAB
	\n	LF (Unix end-of-line)
	\v	vertical tab
	\r	CR (Mac end-of-line)

Usage:

html2tabbed.sh -h cat file.html | html2tabbed.sh [ -h ]

Options:

-h: Display a help message listing options.

Limits of current version:

Assumes a separate line in input for each element, e.g:
<TR>
<TD>...</TD>
...
</TR>
Compresses any whitespace to a single space.

For large tables, the conversion can take some time, so you probably won't want to pipe it directly into the destination DBMS; instead, save the output for each table as a file and then import the file in a separate step.

Better yet, if you're using PostgreSQL, use another shell script, html2pgsql.sh, to convert each HTML file into an SQL script that imports the data into the specified table in a specified PostgreSQL database.

html2pgsql.sh

Converts Microsoft Access "Export as HTML" table dump to an SQL script that imports the data into a specified table in the specified PostgreSQL database. Uses html2tabbed.sh, which has some limits on capabilities.

Usage:

html2pgsql.sh -h html2pgsql.sh [ options ] inputfile.html table

Unless the -p option is specified, the script saves output as table.dump where table is the table name. A commented header with instructions for loading the table precedes the SQL commands. To use:

Be sure table exists in target database with appropriate columns and types.
Execute table.dump as the SQL file it is:
psql -d database -U databaseuser -f table.dump

Options:

-h: Display a help message listing options.
-p: Pass-through; write the SQL script to standard output with no instruction header.

21 July 2004

Ted Ruegsegger