Sketchy LISP |
By Nils M Holm,
2006,2007,2008 Buy a copy at Lulu.com |
An Introduction to Functional Programming in Scheme
| Previous Section | - Contents - Index - | Next Section |
This section will demonstrate some things you can do with strings
and characters. In an earlier section, predicates for comparing
chars have been introduced, but chars have some other attributes as well.
For example, there are upper and lower case
characters. Scheme provides a set of predicates for testing the
properties of chars such as
char-alphabetic?,
char-numeric?,
char-upper-case?,
char-lower-case?
and
char-whitespace?. The effects of these
procedures are quite obvious, so they will not be explained in detail.
The following procedure prints all properties of a given char
using these predicates:
(define (char-properties x)
(apply append
(map (lambda (prop)
(cond (((car prop) x) (cdr prop))
(else '())))
(list (cons char-alphabetic? '(alphabetic))
(cons char-numeric? '(numeric))
(cons char-upper-case? '(upper-case))
(cons char-lower-case? '(lower-case))
(cons char-whitespace? '(whitespace))))))
Given this definition,
(char-properties #\X) => (alphabetic upper-case) (char-properties #\0) => (numeric) (char-properties #\newline) => (whitespace)
The char-properties procedure is quite straight-forward but
rather artifical. It merely serves as a brute-force example for illustrating
a large number of char predicates at one time. The only thing that is
interesting about it is the fact that it uses another instance of the
combination of apply and map. In this example,
append is applied to the result of map in order
to remove the empty lists generated by predicates that where not satisfied.
Without applying append, the results of
char-properties would look a bit messy:
((alphabetic) () () (lower-case) ())
By applying append, the sublists of the result get
appended. Because appending an empty list to a list x
yields x, the empty lists are eliminated from the result.
The creation of flat lists using apply and append
is a useful tool, as will be demonstrated later in this section.
Another thing you can do with chars is to convert their case.
The char-upcase
procedure converts a lower case character to upper case, and the
char-downcase
procedure converts an upper case char to lower case:
(map char-upcase '(#\a #\- #\Z)) => (#\A #\- #\Z) (map char-downcase '(#\a #\- #\Z)) => (#\a #\- #\z)
As you can see, both of these procedures pass through characters that are not to be converted. That is about all you can do with chars in Scheme.
Because strings are more complex than chars, the things you can do with them are much more interesting, too. For instance, you can create a list of all sub-strings of a string or you can create permutations of strings, which is useful for finding anagrams. Here are some sample solutions to these problems.
To extract a sub-string from a string, the
substring
procedure is used. It works as follows:
(substring "hello" 1 4) => "ell"
The first argument of substring is the source string.
Its second argument specifies the position of the first character to
extract and its third argument the position of the first character
not to extract. The position of the first character of the
source string is zero (all string procedures of Scheme use this convention).
The third argument of substring must not be less than
the second argument. The two arguments may be equal, though. In this
case, an empty string is extracted:
(substring "hello" 3 3) => ""
Because of this convention, it is even legal to extract an empty string from an empty string:
(substring "" 0 0) => ""
Specifying an offset that is outside of the source string or a negative range (second argument greater than third) is undefined:
(substring "abc" 3 4) => bottom (substring "abc" 1 4) => bottom (substring "abc" 2 1) => bottom
Upon success, substring creates a new string
that contains the characters that were extracted from the source string.
Using substring, writing a procedure that creates
a list of all sub-strings of a string is quite straight-forward.
The first step is to create the procedure fixed-length-substrings
which creates a list of all sub-strings with a fixed length:
(define (fixed-length-substrings s n)
(let ((len (string-length s)))
(letrec
((f-l-subs
(lambda (pos)
(if (> (+ pos n) len)
'()
(cons (substring s pos (+ pos n))
(f-l-subs (+ pos 1)))))))
(f-l-subs 0))))
Using fixed-length-substrings, a procedure to extract all
possible sub-strings is easy to do. All you have to do is pass the
values from 1 to the length of the given string to
fixed-length-substrings and collect the results. The
string-length
procedure for computing the length of a string already has been
introduced in the code above, so here is the code of the
sub-strings procedure that returns all sub-strings of a
given string:
(define (sub-strings s)
(let ((len (string-length s)))
(letrec
((subs
(lambda (n)
(if (> n len)
'()
(append (fixed-length-substrings s n)
(subs (+ n 1)))))))
(subs 1))))
Here are some sample applications of sub-strings:
(sub-strings "") => ()
(sub-strings "x") => ("x")
(sub-strings "Scheme")
=> ("S" "c" "h" "e" "m" "e" "Sc" "ch" "he" "em" "me"
"Sch" "che" "hem" "eme" "Sche" "chem" "heme" "Schem"
"cheme" "Scheme")
There is no magic involved in any of these procedures. The only
interesting part of fixed-length-substrings and
sub-strings is the fact that they use nested applications
of let and
letrec
in order to save an argument in their inner procedures. For instance,
subs could have been implemented as a two-argument
procedure that carries len with it. However, len
is constant and the creation of a context only has to be done once
while the additional argument would have to be passed along each
time subs recurses. The creation of an additional context
is usually cheaper and often more readable than an additional
argument.
The second procedure to be constructed in this section is a
procedure that generates
permutations of a
string. A permutation of a string is a string of the same
length and containing the same characters but in a different
order. For example "bca" is a permutation of
"abc". The permute procedure to be
written here, however, will produce all permutations
of a string. For the input "abc" it will return
("abc" "acb" "bca" "bac" "cab" "cba")
The first step towards a solution is, of course, to investigate how permutations are created systematically. Here is a first approach:
For each character C of a string S
-- extract C from S, giving a string S'
that contains all characters but C;
-- attach C to all permutations of S'.
What this solution is lacking is a trival case. In fact there are two, because there are no permutations for empty strings, and there is only one permutation (the string itself) for single-character strings, so here is an improved recipe:
The permutations of a string S are created this way:
-- if S is empty, return ();
-- if S contains only one character, return (S);
-- otherwise:
for each character C of S:
-- extract C from S, giving the rest S';
-- attach C to all permutations of S'.
The extraction of a single character from a string is done by the
string-ref
procedure which takes a string and a position as arguments and extracts
the char at the given position:
(string-ref "abcd" 2) => #\c
As usual, the first character is at position zero. A procedure that creates a substring containing all characters but the one at a given position is simple, too:
(define (all-but-this-char s n)
(string-append
(substring s 0 n)
(substring s (+ n 1) (string-length s))))
There is an even simpler solution, though. When a string of the length n is rotated n times, each of its characters occurs at the first position exactly one time:
"abcd" "bcda" "cdab" "dabc"
So it would make sense to create all rotations of the input string and then map a procedure over the resulting list that creates C and S', computes all permutations of S' and then re-attaches C to each permutation. But let us do this step by step. The first procedure that is needed is one that creates a rotation of a string. Here it is:
(define (rotate s)
(string-append (rest-of s)
(first-of s)))
where rest-of and first-of are defined this way:
(define (first-of s) (string (string-ref s 0))) (define (rest-of s) (substring s 1 (string-length s)))
These two procedures have been factored out because they will be useful at a later time, when C and S' are created. Note that first-of extracts a string containing the first character and not just a char. This is helpful when appending the output of first-of and rest-of at a later time.
Rotate rotates a string by appending a string containing its first character to the rest of the string:
(rotate "abcd") => "bcda"
Based on rotate, a procedure for creating all
rotations of a string is not hard to do. The number of rotations
to perform is equal to the length of the string, so the procedure
has to iterate (string-length s) times to
collect all rotations of s:
(define (rotations s)
(letrec
((rot (lambda (s n)
(if (zero? n)
'()
(cons s (rot (rotate s) (- n 1)))))))
(rot s (string-length s))))
Rotations delivers all rotations of the given string:
(rotations "abcd") => ("abcd" "bcda" "cdab" "dabc")
All that is left to do now is to write the permute
procedure itself. The two trivial cases are easy to implement. The
hard part is the procedure to be mapped over the rotations of the
input string. It has to extract the first character of each
rotation (giving C and S'), permute the
remaining string S', and then attach C to
each permutation. Because it maps each rotation to a list of
permutations, it uses map to re-attach C.
This leads to a nested application of map, which might
look a bit confusing at a first glance, but the procedure will be
explained in great detail immediately. Here is the code:
(define (permute str)
(cond ((= (string-length str) 0) '())
((= (string-length str) 1) (list str))
(else (apply append
(map (lambda (rotn)
(map (lambda (perm)
(string-append
(first-of rotn)
perm))
(permute (rest-of rotn))))
(rotations str))))))
And here is what happens inside of permute: When a string,
say "abc", is passed to it, the general case kicks in. It
first creates all rotations of the input string:
(rotations "abc") => ("abc" "bca" "cab")
Then the following procedure is mapped over the list of rotations:
(lambda (rotn)
(map (lambda (perm)
(string-append (first-of rotn) perm))
(permute (rest-of rotn))))
The first thing this procedure does it to create all permutations of the rest of the current rotation. When the recursively called permute returns, it has mapped the rest of one rotation into a list of its permutations:
(permute "bc") => ("bc" "cb")
The inner map then pastes the string containing the first
character of the current rotation to each member of that list, as the
following expression demonstrates:
(let ((rotn "abc"))
(map (lambda (perm)
(string-append (first-of rotn) perm))
'("bc" "cb")))
=> ("abc" "acb")
When the outer map finishes, it has created combinations of all possible first/rest pairs:
(("abc" "acb") ("bca" "bac") ("cab" "cba"))
All that is left to do at this point is to append the sublists, so
that a flat list of permutations is returned. This is done by using
the combination of apply and append mentioned
earlier in this section.
The permute procedure works for strings of any size, but it slows down significantly as the string grows. There is a pretty obvious optimization, though. It is based on the observation that the permutations of each two-character string are equal to the rotations of that string:
(permute "ab") => ("ab" "ba")
(rotations "ab") => ("ab" "ba")
The permutations of two-character sub-strings are created quite frequently during evaluations of permute. There are n! permutations of a string of the length n, which means that permute recurses
(* n (- n 1) ... 2 1)
times to create all permutations. To do so, it combines each of the possible n first characters with
(* (- n 1) (- n 2) ... 2 1)
permutations. To do so, it combines
(* n (- n 1))
possible prefixes with
(* (- n 2) (- n 3) ... 2 1)
possible permutations. Iterating this formula further finally yields
(* n (- n 1) ... 3)
prefixes and two permutations. In other words: n!/2 prefixes are combined with permutations of two-character strings, so permute spends almost half of its time generating permutations of two-character strings. Adding another trivial case to handle these strings should speed up the procedure significantly. Here is the optimized code (the additional clause prints in boldface characters):
(define (opt-permute str)
(cond ((= (string-length str) 0) '())
((= (string-length str) 1) (list str))
((= (string-length str) 2) (rotations str))
(else (apply append
(map (lambda (rotn)
(map (lambda (perm)
(string-append
(first-of rotn)
perm))
(opt-permute (rest-of rotn))))
(rotations str))))))
Most Scheme environments provide a facility that allows the user to measure some properties of an algorithm. One property that is frequently measured is the number of conses during program execution, i.e. the number of memory cells allocated. Using such a facility, the plain version of permute can be compared to the optimized version. Here are the results: [Footnote: Results where obtained using SketchyLISP's :statistics option. SketchyLISP is a now obsolete interpreter that was described in the first edition of this book.]
; Expression Nodes allocated Reduction Steps (permute "abcde") 3,228,729 574,788 (opt-permute "abcde") 2,374,935 425,690
The actual numbers may differ between environments, but these numbers show that the optimized version saves about 35% of the reduction steps and also about 35% of the used memory. This is not too bad for a single-line optimization.
Procedures like permute seem to allocate vast amounts of memory. When creating the permutations of a seven-character string, the procedure allocates about 78 million nodes in the SketchyLISP interpreter. Given a node size of 9 bytes (on a 32-bit architecture), this makes about 702 million bytes, and the procedure does not even seem to contain any code to release unused storage. So how does memory management work in Scheme?
From a programmer's point of view, this is simple: just keep
allocating things and Scheme makes sure that unused stuff is
recycled and returned to the pool of free memory. Sounds too
good to be true? There must be a catch? No, there is none.
Automatic memory management has made quite some progress in
the past years, and there is really no excuse for managing
memory manually
these days. Let us see how automatic
memory management works.
Many Scheme procedures allocate memory. In the permute
procedure, substring was used heavily. Each time
substing is called, it allocates a new string. The same is
valid for string-append, and even string-length
might create a boxed integer.
Recursive procedures like permute create really huge
amounts of intermediate results. (Opt-permute "abcdefg")
creates about 700M bytes of intermediate data, but delivers just a list
of 5040 seven-character strings, allocating about 85K bytes. However,
the peak memory load created by opt-permute is only about
190K bytes when run in SketchyLISP.
[Footnote: You can verify this by using the :gc meta
command of SketchyLISP.]
To be able to allocate 700M bytes of memory in a pool that may be as small as one megabyte, Scheme has to recycle unused data on a regular basis. It normally does so when the free space of a memory pool runs low. When this happens, it marks all data that can be accessed at the current time and releases the rest to the pool of free memory (which is also called the freelist by LISP programmers).
Note that Scheme actually proves that an object is no longer
usable before it recycles its memory. It never throws away data
that you might want to use at a later time. How does this work? Any
object can only be accessed as long as it is referred to by a variable
or contained in a list (or vector) that is referred to by a variable.
When an object is not accessible through any variable any longer, it
becomes garbage
. For example, when passing the
string "Hello, World!" to string-length,
this string is bound inside of string-length as long as
the procedure reduces. As soon as it returns its result, though,
the string becomes uninteresting:
(string-length "Hello, World!") => 13
There is no way to access the string at this point, because it is
no longer referred to by any variable. If you want to use a string
containing the same letters again, you have to create a new one. The
same happens in recursive procedures. This principle will be explained
by means of a sample implementation of string-length:
(define (string-length s)
(letrec
((str-len
(lambda (x r)
(if (null? x) r (str-len (cdr x) (+ 1 r))))))
(str-len (string->list s) 0)))
When a string is passed to this procedure, it is bound to s, converted to a list, and then passed on to str-len where the list is bound to x. At this point there is a string and a list kept in memory (and the integer r, but it does not matter in this explanation). When the general case occurs in str-len, the cdr part of the list is passed to str-len. This causes the cdr part of the original list to be bound to x. Because str-len is tail-recursive, the original binding of x is discarded. At this point, no variable refers to the full list any longer, so the car part of the list becomes garbage. At each iteration, one member of the list becomes useless, and when the trivial case is reached, the entire list is garbage. Once str-len is called, the memory usage of the program decreases. The program releases memory, and it does so without any explicit instructions to do so.
The process of recycling unused memory is widely known as garbage collection (or in short: GC). GC got a bad name because many people think that it causes a great overhead and, even worse, may stop the program at random locations in order to reclaim storage. The latter is an artifact from quite early times. Modern garbage collectors do not stop the program to do their task. They work in the background while the program runs. The overhead caused by garbage collection (concurrent or not) is no larger than the overhead caused by any decent memory manager. The only scenario where manual memory management wins is a program that only allocates memory but never releases it. As soon as memory has to be released, the two principles run head to head, but GC frees you completely from the task of keeping track of used memory. Using GC, memory leaks cannot occur! [Footnote: A program that fails to release unused memory is said to have a memory leak.]
Another thing that modern GCs avoid is memory fragmentation. Memory fragmentation occurs when lots of small regions are randomly allocated and freed over a period of time. Doing so leaves a memory layout like this:
XXX...XXX..XXX...XX...XXX...XX..XXX...XX . = 1 unit of free memory X = 1 unit of allocated memory
Although the above region contains much more than four units of free memory, it is impossible to allocate a four-unit object in it, because the largest contiguous free region has a size of three units. Therefore, a program may report insufficient memory although sufficient memory exists - just not as a contiguous region.
What modern garbage collectors do in such cases is called memory compaction. The allocated units are moved to one end of the memory pool and the free units to the other, leading to the following layout:
XXXXXXXXXXXXXXXXXXXXX...................
After compacting memory, the same pool provides enough space for allocating a 19-unit object.
Automatic memory management does not only deliver the same
performance as explicit memory management in non-trivial
scenarios, it also makes better use of the existing memory by
avoiding fragmentation and it frees the programmer from the
tedious and superflous task of keeping track of used memory.
Using garbage collection, dangling references
(references
to accidentally freed regions) and memory leaks belong to the past.
| Previous Section | - Contents - Index - | Next Section |