I’ve been doing a lot of thought about language design recently. I just switched from a couple of years of Visual Basic hackery back to the more familiar territory of Java, and while most of my time has been spent reacquainting myself with the landmarks of my college town and checking out the features of the new mall, I’ve also had a chance to check out some of the neighboring countryside. The up-and-coming development called C# is growing nicely, and while I was wandering the old town suburbs of Bash scripting and Perl I ran into the charming subdivisions of Python and Ruby, also with many features.
But while I was born programming FORTRAN and shortly thereafter moved to BASIC, my hometown language will always be LISP. And while I like many of the features I find in modern languages, I find myself still hankering after LISP’s elegant s-expressions and the ability to compose arbitrary data structures with them.
I suppose that’s the same nostalgia a C programmer gets for the ability to create arcane constructs like a dereferenced an array of pointers to functions. And I wouldn’t want to give up my cherished Java packages, objects and methods (or C#’s namespaces, objects and methods, or VB’s references, objects, and methods) in favor of (load “myfile”) just to get s-expressions. But I suspect that C programmers are far happeir with the tradeoffs they have moving to C++ than I am with moving from Java to Lisp.
The C programmer loses some speed and freedom in C++, but keeps all of his old operators while gaining classes, inheritance, and the Standard Template Library. I, on the other hand, gain classes, inheritance, platform independence, and a vast library — but Java’s collection classes are a poor substitute for s-expressions and Lisp’s list operators.
This isn’t the only thing that you lose. Java and Visual Basic both have good regular expression libraries, but they’re a pain in the butt to work with compared to the elegant integration you find in Perl, Python or Ruby. And there are many other language technologies that have arisen in recent years — the slice notation for sequences from Icon which is now found in Python and Ruby, the interned immutable strings of Java and Python, list comprehensions in Python, hashes from Perl and Python, packages and namespaces from Java and C# — that haven’t yet migrated to as many other languages as they should.
I know different languages have different purposes. A shell script is not a scripting language, and a RAD tool is not for programming provably correct programs. But, damn it, programmers should be able to USE these language technologies, no matter what language they come from! Why can’t I say something like “foreach i in [0..9] do println myArray[1:i].toString();” in just about any language to print a triangle of array values, rather than the torturous process I have to go through to do this in most normal languages?
So, I’ve decided to do something about it. I’m going to design my dream language on paper, and then all you language zealots can tell me why your particular language trumps it. I’m going to start to collect my favorite language features in my blog, and start to collect comments about what features work with each other. I don’t want to create a kitchen sink of a language like PL/I that no-one would use; I want to collect a list of safe language features, syntactic constructs, and useful operators that anyone ought to be able to include in their language, and then start discussing how we can begin to use these more effectively in future language design.
To start with, here are a few language features I’ve come across that I think are cool — or, more pointedly, that I think should be a natural part of any language other than low-level system workhorses and toy language-theory workbenches:
Regular Expressions.
Awk, Perl and Python programmers take these for granted. Programmers in other languages should be able to as well. Visual Basic and Java expose regular expression frameworks which you can access in a clunky way using object-oriented notation, but there’s something to be said for syntactic support at the level of the =~ operator. Other languages, like C and Lisp, simply leave you twisting in the wind trying to roll your own. No more, I say to you future language designers: go thee emulate “$scalar =~ /bladeblah/”, or improve upon it. Enough said.
Slice Notation.
For the longest time, I never thought of using arrays any other way than the usual: “declare myArray[size]; pass myArray = arrayOund; get myArray[element];”. Then I saw Python’s slice notation myArray[3:5] and saw the light. Why shouldn’t I be able to refer to the subelements of an array by something as simple as [3:5]? Or everything to the end of the array as [5:]? And do the same for strings as in “This that the other”[3:5] when the language supports viewing strings this way? I guess my point is that you as a language designer may not want that special syntax because it wrecks the purity of your object oriented syntax model, tweaks your function calling notation, doesn’t fit with your ideas of programs as data, or simply because you, Larry, want the colon. In the end, people have to use the damn language to do things. While a simple, clean syntax makes rare things possible, it can make easy things hard. Get over it and add slices to your language.
Coexistence of Object-Oriented and “Bare Metal” Types
I know from experience you can use Java in both a pure object-oriented, LISP-like high-level way and a low-level, C-emulation mode, with a consequent tradeoff of programming flexibility for speed and power. I think part of Java’s success is its vast library of objects, which in turn can use bare ints, booleans and floats to do the meat of the programming. I think C# will become even more successful for a similar reason, because it provides even more opportunity to manipulate the metal while letting you fly off into high-level object land if you need to.
How Does It All Fit Together?
It doesn’t yet. If I was to throw all the items in my list into some bastardized example I’d get something like:
println "The first ten characters of the method name are: "
foreach character in ( strMethodCall =~ /[A-Z](.*)\./ ).[1:10] do
println " " + character
Hm.
I’m not sure I’d want to program in yet. How are blocks indicated – by indentation? Ick. Where do statements end? Can we omit the semicolons? Should we add braces? Can we come up with better syntax regular expressions than the gawdawful “/[A-Z](.*)\./”, or do we just stick with it because it’s standard? Do we call the loop method “for” as in Python, or “foreach” because we want to reserve “for” for a C-style loop?
I have another 30 or so things on my list, and I’m not going to go into all of them in this essay, saving them for future entries instead. I’m going to keep at this sounding board for a while, proposing useful constructs I’ve mined from reading language definitions, in the hope of finding a basic set of syntactic constructs that are clear, useful, productive, and most of all, satisfy the principle of least astonishment: a C or Pascal or Lisp programmer should be able to move to this new language and see its programming language constructs are somehow … familiar, even if they’ve never used them before.