The Yin and Yang of Typing

Without a little background in programming languages or computer science in general, it's entirely possible that typing systems are not something that have crossed your mind. I thought I'd take a blog entry to share some of my thoughts on how it's affecting the creation and evolution of languages.

First of all, Benjamin C. Pierce probably has a point: terminology used to refer to typing concepts is about as useful as buzzwords like AJAX or Web 2.0 these days. Be that as it may, I'm going to reach back into the recesses of what I recall from the programming languages course I took in college to recall some of this terminology.

If you aren't familiar with static versus dynamic typing or strong versus weak typing, it may be worth it to read up on those before proceeding with the rest of this blog entry. Here are a few examples of each:

  • Static/weak - C
  • Static/strong - Java
  • Dynamic/weak - PHP
  • Dynamic/strong - Python

The line between strong versus weak typing seems to be blurred as languages like these evolve. The reason for this is that each side of typing has its advantages. Strong typing allows for compile-time checking, which can serve to eliminate human error, as well as performance optimizations from being aware of types at compile-time. They can also serve to make source code more intuitive to follow in some respects. Weak typing, on the other hand, can allow for higher levels of abstraction and, by proxy, the need for less code in order to allow identical operations to be executed on multiple types. It can also allow for things like variable variables, variable functions, and other interesting features not possible in strongly-typed languages.

Yet languages on either side of the proverbial fence are drawing in strengths from the other side. Java, before limited to the flexibility that could be provided by polymorphism while still maintaining strong typing, introduced generics in 1.5, whereby typing was still enforced but a higher level of logic abstraction was enabled for developers. By the same token, PHP has had explicit typecasting for a while and more recently in 5.1 introduced type hinting for array and object types (which may extend to scalar types in later versions). C# in 3.5 adds type inferencing, which while it's only syntactic sugar at least alleviates the need for verbosity when performing the most common method of initialization (i.e. setting a variable of a given class to an object instance of that class, as opposed to one involving a subclass of one or more of the classes involved).

It's also becoming commonplace for dynamically typed language interpreters to get ported to Java and .NET in order to leverage the features of those languages and the native libraries of the host language in the existing execution environment. Take these examples for instance.

In short, some level of control over typing is obviously a desired feature in any useful language. As well, I don't think a language can be truly useful without having a bit of both worlds to some degree. The reason for the existence of programming languages is to enable developers to control machines whose primary purpose is to manipulate data (and, as has been pointed out many times before, are stupid and do what we tell them to do). If control over said manipulation is hampered by the typing system, it hampers the effectiveness of the language. In this, I have to agree with Ludwig Wittgenstein, who said, "The limits of my language mean the limits of my world."

More Oracle and Java Woes

Today I continued the trek toward completing the project described in my last entry. Though I don't think I ran into as many issues today as I did in the past week or so of working on the project, today certainly had it's fair share.

First up was a rather interesting exception being thrown by a JDBC operation, namely "java.sql.SQLException: SQL string is not Query." This is apparently intended to be JDBC's way of explaining that PreparedStatement.executeQuery() doesn't work for DML operations. To execute one of those, you have to use either execute() or executeUpdate(). Thankfully, a forum thread was able to point me in the right direction on that one.

Next on the list, if Oracle JDeveloper 10.1.3.3.0 tells you "The WAR file is already up to date," don't believe it! I don't know what logic it's using to decide whether or not the class files constituting a WAR file are out-of-date, but there are definitely some cases where it's flawed. I spent a better part of the morning trying to figure out why everything from undeploying and redeploying the EAR file to bouncing the OAS installation was still giving me illogical output. Come to find out, I didn't know the WAR file not being updated was relevant to the problem at the time, but it certainly proved to be in the end! Tried searching for a bug report on this, but came up empty, so maybe it's just me.

Last but not least, I take issue with the language used in the mod_plsql User's Guide to describe its process of file upload handling. Though it never explicitly states this, it seems to imply that the internal handling of performing an INSERT operation to place data for an uploaded file into the document table takes place in a separate transaction from that of the action procedure that gets executed afterward.

You have to go to the PL/SQL User's Guide to read why this is not the case. To sum it up, a transaction can span multiple procedures. A procedure being executed as a data cartridge operates within a transaction that is implicitly committed when that procedure terminates so long as no uncaught exceptions are raised. However, until that point, the effects of any DML operations executed are only visible to the procedure. This includes the INSERT procedure performed by mod_plsql on the document table. What this effectively means is that the only way something other than the procedure can see that the inserted record exists unless the procedure does an explicit COMMIT.

If you read my last post, you know that I was calling a servlet from the data cartridge. You can probably imagine the amount of aggravation this caused me when I ran my servlet locally without issue, had to backtrack to figure out where the servlet was failing when it was deployed, and then found out that a single COMMIT statement at the beginning of my data cartridge procedure made things work as expected. So, yay for lacking Oracle documentation.

I did get the servlet working, though. It can now pull data from the database, convert it from Excel binary to CSV format, and put the converted data back into the database. So, the Clean Content API, while not specifically designed for the purpose for which I'm using it, is at least a somewhat capable solution. That basically sums up my day, folks. I'll be back on Monday to do it again.

Extracting Data from Excel with Oracle Clean Content

I got assigned an interesting project at work recently. It involved receiving a file upload via PL/SQL. This in itself is relatively trivial and easy to accomplish when running data cartridges on Oracle Application Server via mod_plsql. What was less unremarkable about the nature of the task was that the uploaded file was intended to be a Microsoft Excel binary file containing a single worksheet. Unfortunately, PL/SQL isn't so divergent in its available native packages that it has readily available functionality to easily handle this situation.

Luckily, my boss had recently visited the annual Oracle OpenWorld conference and while there learned of a new technology of theirs that could help: the Outside In Clean Content API. I'm uncertain as to whether this product came under Oracle's branding as the result of a merger, buyout, partnership, or what have you. After poking around the net, I saw that it has thus far received very little coverage, presumably because it was a relatively new release.

Clean Content's primary purpose is to remove "identify and remove sensitive, confidential or proprietary metadata and hidden information from Microsoft Office documents." Of course, to be able to accomplish this, it needs to be capable of extracting said data from these document formats. As a side feature of sorts, they expose this functionality in their API, which is available in the form of C++, C#, and Java libraries.

Originally, when I began work on the project, the requirements stated that the uploaded file would be in CSV format. The format requirement was changed later, after I had developed a prototype capable of handling a CSV file. To adapt my existing work to this new requirement, I developed a Java servlet to supplement it, which the data cartridge would call using UTL_HTTP.REQUEST.

This servlet received the name of an uploaded Excel file, used JDBC to pull the binary data from the local database, used the Clean Content API to convert it to CSV format, and used JDBC again to put the converted data back into the same table. It didn't end up amounting to much in the way of LOC, but it did require some learning on my part.

First off, the Clean Content API is structured in a SAX-like fashion. The best resources to learn it are actually both included in the free download: the Developer's Guide and the JavaDoc API documentation. Examples in the former show how to restrict the API to analysis only (i.e. not modifying the document data), provide in-memory data to the API (via a ByteBuffer), and how to specify a handler class to intercept events. You may have to peruse several examples to find all this out, but it's all there if you take the time to read through it (and selectively skip all the parts having to do with document manipulation).

Your handler class has to extend the BaseElementHandler or GenericElementHandler class in the API. I recommend the latter during development, as its start() method can help in the debugging process by showing you what data is being extracted.

The startTextCell() method will indicate when the parser is within a spreadsheet cell containing textual data. However, the TextCellElement it receives contains only coordinate information, not the value of the cell. (Quick note: the coordinate system is 0-based, meaning that the coordinates of the first cell of the spreadsheet are 0, 0.) To actually capture the text, you have to use the text() method. This is a little confusing, but the reason is that it's possible to encounter textual metadata outside of the spreadsheet cells. A simple class flag property can be used so you know when you are or aren't within spreadsheet cells when this event occurs.

The startDataCell() method indicates when numeric data is encountered. Something worth mentioning here is that the Excel binary format houses dates as integers. To convert such a number back to its equivalent date, take the date 1/1/1900 and add that number of days to it using GregorianCalendar.add(). An example of this is 39,085, which corresponds to 1/3/2007. You can further format this further by passing the return value of GregorianCalendar.getTime() to SimpleDateFormat.format().

One oddity I ran into during development that was unrelated to the Clean Content API was with the JDBC library. I executed a SELECT query, got back a ResultSet object, and then attempted to call ResultSet.getBytes() to place the value of a BLOB column into a byte array. This was so I could pass that to ByteBuffer.wrap() to be used with the Clean Content API later. However, the returned byte array always came back severely truncated judging by its length and the fact that the Clean Content API could not determine the document type based on it. I wasn't able to get around to examining the content byte by byte to determine the cause of this, but I did find a solution: ResultSet.getBlob() returns a Blob object and Blob.getBytes() returns the needed (complete) byte array. Apparently Oracle condones this method of obtaining the value, so rather than beat myself up trying to figure out the weirdness that is this situation, I followed the well-beaten path.

Beyond troubleshooting these oddities, along with relearning how to write servlets and learning how to test them in Oracle JDeveloper and deploy them using Oracle Enterprise Manager (and running into this issue in the process), the process of implementing these project requirements was pretty straightforward. Hope my learning experiences end up helping someone else out there. I'm sure there are other existing solutions that could have been applied here, but if nothing else, it showed that there's more than one way to skin this cat.

Page:  1