Monday, October 31, 2011

Bin2Txt - Remove Non-printable Characters from a Text File with Java

I recently was dealing with some DB2 "unload" (e.g., export) files that I wanted to parse and then load into Oracle. I found that the unload files use a lot of binary characters, which makes it very difficult to parse. I wrote the following Java class to convert the unprintable characters into a tilde (which is a character that does not occur in the data). This resulted in DB2 unload files that were parsable as fixed-width data files.
The main problem this approach does not attempt to solve is that the DB2 unload files save numeric fields as the actual value, not the digit equivalent (i.e., the number 84 is unloaded as the ASCII-equivalent "T", not "84"). This code obviously does not reference the DB2 "punch" (e.g., parse instruction) files, so it makes no attempt to parse the files into fields itself - that is a separate exercise in my case. BTW, if there is a good way to import these files into Oracle automatically, please let me know, as I have not been able to find a better solution.
This code is fairly generic, and can be used for other purposes beyond converting DB2 unload files, so if you have a need to replace non-printable characters in text files, you can start with this code base.

package com.threeleaf.bin2txt;

import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;

/**
 * Purpose is to read a file and replace non-printable characters with a given character.
 * Specifically, I want to use this to make DB2 unload files parsable with other applications so
 * that the data can be imported into Oracle.
 *
 * @author John A. Marsh
 * @since 2011-10-27
 */
public final class Bin2Txt {

    /**
     * Run this class from the command line with:
     * java Bin2Txt <pathAndFilename>.
     *
     * @param args
     *        the filename to convert
     * @throws IOException
     *         Signals that an I/O exception (e.g., file not found) has occurred.
     */
    public static void main (final String[] args) throws IOException {
        final byte ASCII_SPACE = 32;
        final byte ASCII_CR = 13;
        final byte ASCII_LF = 10;
        final byte ASCII_TILDE = 126;

        try {
            final File file = new File(args[0]);
            final InputStream inputStream = new FileInputStream(file);
            final long fileLength = file.length();

            /*
             * Array needs to be created with an int type, so need to check to ensure that file is
             * not larger than Integer.MAX_VALUE.
             */
            if (fileLength > Integer.MAX_VALUE) {
                throw new IOException("File is too big");
            }

            /* Create the byte array to hold the data */
            final byte[] bytes = new byte[(int) fileLength];

            /* Read in the bytes */
            int offset = 0;
            int numRead = 0;
            while (offset < bytes.length && (numRead = inputStream.read(bytes, offset, bytes.length - offset)) >= 0) {
                offset += numRead;
            }

            /* Ensure all the bytes have been read in */
            if (offset < bytes.length) {
                throw new IOException("Could not completely read file " + file.getName());
            }
            inputStream.close();

            for (int i = 0; i < bytes.length; i++) {
                if (bytes[i] == ASCII_CR && bytes[i + 1] == ASCII_LF) {
                    /*
                     * Preserve line breaks (carriage return + line feed) by skipping over them.
                     * Note that I don't check for end of file here because I already know my
                     * particular files will never end with a CRLF.
                     */
                    i = i + 2;
                }
                if (bytes[i] < ASCII_SPACE || bytes[i] > ASCII_TILDE) {
                    /* Replace all non-printable characters. */
                    bytes[i] = ASCII_TILDE;
                }
            }
            /* Output file name will be the same as the input, with ".out.txt" added to the end. */
            final OutputStream outputStream = new FileOutputStream(args[0] + ".out.txt");
            outputStream.write(bytes);
            outputStream.close();
        } catch (final ArrayIndexOutOfBoundsException e) {
            /*
             * If no file was passed on the command line, this exception is generated. A message
             * indicating how to the class should be called is displayed.
             */
            System.out.println("Usage: java Bin2Txt filename\n");
        }
    }
}

Here is a batch file that will convert all the files in a given directory:

:: Directory where Bin2Txt.class is located ::
cd C:\projects\workspace\bin2txt\bin\
:: Put in directory where unload files are ::
for %%f in ("C:\projects\Database\Unloads\*.txt") do call java com.threeleaf.bin2txt.Bin2Txt %%f

No comments: