Han Joosten & Dan Lau
2024-03-11
https://hanjoosten.github.io/HaskellWorkshop
Form groups of 3 to a whiteboard/flipboard, and write down some answers. We will then go from board to board, and another group will try to riff on what you have written on your board
You’ve already created your Haskell environment, right?
This gives us a great IDE
ghc
), interpreter (ghci
),
Package managers (stack
and cabal
)Given a web site, we want to scrape it and find important web pages.
This involves a lot of figuring stuff out!
Learn Haskell
Download one web page
Extract links from a page, so we can find more pages to download
Once we’re done, compute which ones are important
Make it all fast?
Open Main.hs
in the src directory. It has the following
contents:
The suffix .hs
is the standard for Haskell source
files.
File names start with a capital letter, and everyone uses
CamelCase
.
The package manager is set up to look for Main.hs
in the
src directory.
This command will install it:
stack install
The generated output will reveal the name of the executable. Run it to see the output.
Is everyone able to build and run their executable?
It’s nice to have fast, native code at our fingertips.
But when I’m working, I expect a few things:
For these circumstances, a full compiler can be a bit slow.
Instead, I sometimes use the interactive interpreter,
ghci
.
Easily done in vscode with extension ghci-helper
:
Start GHCi using stack
It will spin up ghci, load the current file, followed by a prompt:
ghci>
The ghci
interpreter evaluates expressions
interactively.
Try it out:
(That ++
is the “append” operator.)
We defined a function named main
, so let’s invoke
it:
Did that work for you?
What about this?
All interpreter directives start with a “:
”
character.
Let’s see what else ghci
is up to:
:help
tells us what directives are available.:reload
reloads the file we last
:load
ed.:quit
exits back to the shell.Abbreviation works, so :r
is :reload
Error messages can be intimidating. Once you get used to them, they contain a lot of info.
Explaination of common errors can be found at:
https://errors.haskell.org/
Double quotes are just syntactic sugar for the longer form:
What does this print?
We use white space to separate a function from its argument:
If a function takes multiple arguments, we separate them with white space:
If an argument is a compound expression, wrap it in parentheses:
Use ghci
as a calculator.
The **
operator performs exponentiation.
The notation ['a'..'z']
generates a list from start to
end, inclusive.
The sum
function adds the elements of a list.
The show
function renders a value as a string. Try
it!
The length
function tells us how many elements are in a
list.
It is pretty simple to define a new function.
Open up your text editor, create a new file with a .hs
extension, and get writing!
=
character, with the
body of the function following.Load your source file into ghci
and give
myOdd
a try.
Now we can define very simple functions, but we’re missing some important building blocks for more fun.
So let’s get to it!
Q: What does the familiar if
look like in Haskell?
A: Familiar!
We have the following elements:
A Boolean expression
then
an expression that will be the result if the
Boolean is True
else
an expression that will be the result if the
Boolean is False
The two possible results of an if
expression must have
the same type.
If then
evaluates to a String
, well
else
must too!
For instance, this makes no sense:
We are forbidden from writing ill-typed expressions like this.
In imperative languages, we can usually leave out the
else
clause after an if
.
Not so in Haskell.
Why does this make sense for imperative languages, but not Haskell?
Write a function that appends ", world"
to its argument
if the argument is "hello"
, or just returns its argument
unmodified otherwise.
++
.We already know what a list looks like in Haskell:
And of course there’s the syntactic sugar for strings:
But is this everything there is to know?
Supposing we want to construct a list from first principles.
[]
.:
operator.What about extending that list?
You’re probably guessing now that [2,1]
is syntactic
sugar for 2:(1:[])
. And you’re right!
What is the result of this expression?
We refer to []
and :
as
constructors, because we use them to construct lists.
When you create a list, the Haskell runtime has to remember which constructors you used, and where.
So the value [5,8]
is represented as:
:
constructor, with 5
as its first
parameter, and as its second …:
constructor, this time with 8
as
its first parameter, and now as its second …[]
constructorDepending on your background, I bet you’re thinking something like this:
cons
cells in Lisp!”Right on.
Of course Haskell has to remember what a list is constructed of.
It also lets us inspect a list, to see which constructors were used.
How do we do this?
A case
expression allows us to inspect a
structure to see how it was constructed.
case
and of
is the expression
we are inspecting.[]
, then clearly the name
we’re inspecting is
empty, hence not capitalized.If the constructor used was the “add to the front” :
operator, then things get more interesting.
:
constructor
is bound to the name first
.:
constructor
(i.e. everything in the list after the first element) is bound to the
name rest
.->
is evaluated with
these values.The case
expression performs what we call pattern
matching.
->
) is used as the result of the entire
case
expression.Let’s step through the machinery of what happens if we evaluate this expression.
Finally! We can write slightly more complex functions.
Now that you can inspect the front of a list, you should be able to process an entire list recursively.
First, please write a function named myLength
that
computes the number of elements in a list.
Next, write a function named countCaps
that calculates
the number of capital letters in a string.
Wow, that countCaps function was a pain, right?
Here’s my definition that uses only the machinery we’ve learned so far:
I thought Haskell was all about concision!?
We can define a function as a series of equations, each containing a pattern match.
This is nice syntactic sugar for case
.
After each |
is a guard.
(Yes, patterns in a case
can have guards too.)
Like the original version, but with use of case
stripped
out:
Both shorter and easier to follow:
Write a new version of countCaps
as follows:
length
to count the number of elements.This should give the same result as your first function. Right?
Suppose we want to count the number of lowercase letters in a string.
This seems almost the same as our function for counting uppercase letters.
What can we do with this observation?
Higher order function: a function that accepts another function as a parameter.
How can we use this to define countLowerCase
?
By now, we’ve seen several definitions like this:
This is a recurring pattern:
Haskell doesn’t limit us to giving functions alphanumeric names.
Here, we define a function named simply “.
”, which we
can use as an operator:
How to use this?
If that seemed hard to follow, let’s make it clearer.
We’ll plug the arguments into the RHS of our function definition:
We had length
as the first argument to “.
”,
and filter isLower
as the second:
Inside an expression, we can introduce new variables using
let
.
let
.in
.Haskell is sensitive to white space!
If you’re defining local variables, they must start in the same column.
This is good:
But this will lead to a compiler error:
Using function composition wherever you can, write a function that accepts a string and returns a new string containing only the words that begin with vowels.
words
and
unwords
functions before you start.Example:
Here’s how I wrote disemvowel
:
Does this remind you of a Unix shell pipeline, only right-to-left?
Given a web site, we want to scrape it and find important web pages.
We’re now Haskell experts, right?
We’d really like to rely on a library to download a web page for us.
At times like this, there’s a very handy central repository of open source Haskell software:
Go there now!
Click on the Browse link at the top of the page to browse packages.
Alas, the list is overwhelmingly long, but we can find libraries for all kinds of tasks if we’re patient.
search
can be used to restrict the list to find specific
libraries
HTTP
That still gives us 300+ packages to comb through, but at least it’s better than the 16,400 on the Packages web page.
We’ll use the HTTP client library named
http-conduit
.
We can read about it online:
That landing page for a package is intimidating, but look at the section labeled “Modules”.
What do you see?
Before we can use http-conduit
, we must tell the package
manager that we need it.
add http-conduit
to the dependencies
section in your package.yaml
file:
dependencies:
- base >= 4.7 && < 5
- http-conduit
The next time you build your program, it figures out all the other
libraries that http-conduit
depends on, and downloads,
compiles, and installs the whole lot.
While we are at it, also add the following dependencies, for you will need them later as well:
- bytestring
- utf8-string
- tagsoup
- network-uri
- containers
Expect the build to take a few minutes and print a lot of output.
Let’s build the packages we specified in our
package.yaml
file:
stack install
This will install all specified packages and all of their dependencies as well.
While we’re waiting, let’s try to figure out how we should use http-conduit
.
Remember the link to API documentation in the Modules
section of the package’s web page? Click through to the API docs.
(For now, ignore the reference to stackage
)
An API page begins with a title that looks something like this:
Network.HTTP.Simple
This is the name of a module.
A module is a collection of related code.
A package is a collection of related modules.
(This will sound familiar if you know Python.)
After the initial blurb, a module’s docs consists of type signatures and descriptions.
Here is a really simple type signature:
foo :: String
How the heck do we read this?
The name of the thing being defined comes before the
::
characters.
Its type follows after the ::
.
This means “the value named foo
has the type
String
”.
Up until now, we have not bothered talking about types or type signatures.
Every expression and value in Haskell has a single type.
Those types can almost always be inferred automatically by the compiler or interpreter.
Bool
Int
Char
Double
Here’s another type signature:
words :: String -> [String]
Here we see a new symbol, ->
, which means “this is a
function”.
The type after the last ->
is the return type of the
function.
All of its predecessors are argument types.
So this is a function that takes one String
argument,
and returns… what?
The notation [a]
means “a list of values, all of some
type a
”.
So [String]
means “a list of values, all of type
String
”.
a
is a type variable, it can be substituted with any
concrete type. Notice the lower case for type variables, upper case for
concrete types.
What’s a String
?
[Char]
,
i.e. “a list of Char
”.We can introduce new synonyms of our own.
A type synonym can be handy for documenting an intended use for an existing type.
words :: String -> [String]
We can now read that this function accepts a string as argument, and returns a list of strings.
From reading its name and type signature, can you guess what
words
might do?
Tell me about this signature:
mystery :: [String] -> String
What are some reasonable possible behaviours for this function?
Here is the very first signature from http-conduit
:
httpBS :: MonadIO m => Request -> m (Response ByteString)
This is more complex! How the heck do we read it?
The bits between ::
and ‘=>’ are constraints
on where we can use httpBS
- but let’s ignore constraints
for now.
We’ll also ignore that mysterious lowercase m
for a
bit.
What can we tell about this function?
In Haskell, a type can contain one or more type variables.
Response
is defined together with a type variable. Type
variables start with a lowercase character.
data Response body
In our case, body
is replaced by a concrete type:
Response Bytestring
So it is a Response
of Bytestring
A ByteString
is a blob of binary data.
Unlike String
, it is not represented as a list, but as a
packed array.
However, it contains binary bytes, not text!
ByteString
for working with data that you
have to manipulate as text.Now we have a ByteString
, which we need to turn into
text for manipulating.
Let’s cheat, and assume that all web pages are encoded in UTF-8.
Does everyone have http-conduit
installed now?
Add the import of the module to your file, and fire up
ghci
, so we can play with it:
import Network.HTTP.Simple
import qualified Data.ByteString.Char8 as B8
let’s load a web page!
httpBS "http://example.com/"
What is that response?
Response Bytestring
Take a good look at what just appeared as output.
Prettyprinted, it looks like something like this:
Response
{responseStatus = Status {statusCode = 200, statusMessage = "OK"},
responseVersion = HTTP/1.1,
responseHeaders = [("Content-Encoding","gzip"),("Accept-Ranges","bytes"),
etc.
Did you get a response? Yeah!
So far, all of the code we have written has been “pure”.
And yet … somehow we downloaded a web page!
So can we write code like this?
NO.
The type system segregates code that must be pure from code that may have side effects (“impure” code).
Well, let’s look at a simpler example than httpBS
.
Type this in ghci
:
:type readFile
This will tell us what the type of readFile
is.
The :type
directive should print something like
this:
Notice that IO
on the result type?
It means “this function may have side effects”.
We often refer to impure functions, with IO
in the
result type, as actions.
The type system keeps track of which functions have IO
in their types, and keeps us honest.
We can still mix pure and impure code in a natural way:
Critical to what we just saw was the do
keyword at the
beginning of the function definition.
This introduces a series of IO
actions, one per
line.
To capture the result of an IO
action, we use
<-
instead of =
.
The result (contents
) is pure - it does not
have the IO
type.
This is how we supply pure code with data returned from impure code.
This is not the return
type you’re used to!
It takes a pure value (without IO
in its type),
and wraps it with the IO
type.
Pure code can’t call impure code, but it can thread data back into
the impure world using return
.
When you write a Haskell program, its entry point must be named
main
.
The type of main
must be:
()
is named “unit”, and means more or less the same
thing as void
in C or Java.
What this means is that all Haskell programs are impure!
Try:
:t httpBS "http://example.com"
Note the type variable m
. In our case we will use
IO
for it.
In the documentation you can find function definitions like:
getResponseBody :: Response a -> a
In our case we have Response Bytestring
, so this would
become
getResponseBody :: Response Bytestring -> Bytestring
This can be used to get specific stuff from the response!
module Main (main)
where
import Network.HTTP.Simple
import qualified Data.ByteString.Char8 as B8
main = putStrLn "hello, world!"
download = do
response <- httpBS "http://example.com" :: IO(Response B8.ByteString)
let body = getResponseBody response
return body
Note that we have to tell explicitly that we want to use
IO
as concrete value for m
, for other monadic
types would fit too. (Bonus: Take a look at the error message if you
leave the type annotation out.)
Remember we were planning to cheat earlier?
We had this:
We need something whose result is an IO String
instead.
How should that look?
To do the conversion, let’s grab a package named
utf8-string
.
Remember that we added it to our package.yaml file previously? Now we
can use it. It contains a package named
Data.ByteString.UTF8
.
It defines a function named toString
:
Write an action that downloads a URL and converts it from a
ByteString
to a String
using
toString
.
Write a type signature for the action.
Use your download
function to save a local copy of the
page you just wrote.
For simplicity, let’s save the local files as names containing numbers:
To save a local copy of a file, you’ll need the
writeFile
action.
Two truisms:
So! Let’s use another library.
The tagsoup
package can parse arbitrarily messy
HTML.
It will feed us a list of events, like a SAX parser.
We allready added it to our package.yaml
file, so it is
ready for us to use!
If we pass an empty list, the head
function throws an
exception.
Suppose we need a version of head
that will not
throw an exception.
What should the ????
be?
Let’s invent something.
We’re using a constructor named Some
to capture the
idea “we have a result”.
The constructor None
indicates “we don’t have a
result here”.
To bring these constructors into existence, we need to declare a new type.
The |
character separates the constructors. We can read
it as:
The Perhaps
type has two constructors:
Some
followed by a single argument
or None
with no arguments
Actually, Haskell already has a Perhaps
type.
The a
is a type parameter, meaning that when we
write this type, we have to supply another type as a parameter:
Maybe Int
Maybe String
If we want to construct a Maybe Int
using the
Just
constructor, we must pass it an Int
.
This will not work, because the types don’t match:
We can pattern match over the constructors for Maybe
just as we did for lists.
The tagsoup
package defines the following type:
data Tag = TagOpen str [Attribute str]
| TagClose str
| TagText str
| TagComment str
| TagWarning str
| TagPosition !Row !Column
What do you think the constructors mean?
Suppose we want to write a predicate that will tell is if a
Tag
is an opening tag.
What should the type of this function be?
What should its body look like?
Our first body looked like this:
isOpenTag (TagOpen x y) = True
isOpenTag (TagClose x) = False
isOpenTag (TagText x) = False
isOpenTag (TagComment x) = False
isOpenTag (TagWarning x) = False
isOpenTag (TagPosition x y) = False
Concise, but ugly.
We really only care about one constructor.
We never use the variables x
or y
that
we declare.
We can write “I don’t care what this pattern or variable is” using
the “_
” character.
The wild card pattern always matches.
Since we don’t care about x
or y
, we
can state that explicitly using _
.
Since we don’t care about any constructor except
TagOpen
, we can match all the others using
_
.
Why don’t we write the function like this?
Suppose we have a page in memory already.
Browse the tagsoup
docs, in the Text.HTML.TagSoup
module.
Find a function that will parse a web page into a series of tags.
Parsed tags can contain a mixture of tag names.
<A HREF="...">
<a hrEF="...">
tagsoup
function that will turn tag names and
attributes to lower case.Let’s use our function to clean up the result of
parseTags
.
We only care about open tags that are links, so
<a>
tags.
How would we write the type of a function that will indicate
whether a Tag
is an open tag with the correct
name?
How would we use this function to extract only the open tags from a list of parsed tags?
This cascade is getting a bit ridiculous.
processPage url = do
page <- download url
return
(filter (isTagOpenName "a")
(canonicalizeTags
(parseTags page)))
Two observations:
Our action is now mostly pure code.
It sure looks like a pipeline.
Take this function and split it into pure and impure parts.
Write the pure part using function composition.
Let’s skip nofollow
links.
We want to get the "rel"
attribute of a tag.
How would we extract the "href"
attribute from every
element of the list?
Links can be absolute, relative, or invalid garbage, and we only want valid-looking absolute links.
To properly create an absolute link, we need to know the absolute URL of the page we’re looking at.
The Network.URI
package contains some functions we might
find handy.
This is really hard to read!
import Network.URI
canon :: String -> String -> Maybe String
canon referer path =
case parseURI referer of
Nothing -> Nothing
Just r ->
case parseURIReference path of
Nothing -> Nothing
Just p -> Just (uriToString id (nonStrictRelativeTo p r) "")
Surely there’s a better way.
Notice that that function was a series of case
inspections of Maybe
values?
Suppose we had a function that accepted a normal value, and returned
a Maybe
value.
And suppose we had a concise syntax for writing an anonymous function.
The \
is pronounced “lambda”.
The case
analysis is quite verbose. Suppose we had a
function that performed it, and called another function if our value was
Just
.
How could we use this?
canon1 referer path =
parseURI referer `bind`
\r -> parseURIReference path `bind`
\p -> Just (uriToString id (nonStrictRelativeTo p r) "")
If we enclose a function name in backticks, we can use the function as an infix operator.
The >>=
operator is a more general version of our
bind
function.
Here’s some tidier syntax that should look familiar.
process url =
map (canonicalize url) .
filter (not . null) .
map (fromAttrib "href") .
filter (\t -> fromAttrib "rel" t /= "nofollow") .
filter (isTagOpenName "a") .
canonicalizeTags .
parseTags
One awkward thing: what is the type of this function?
Go to this web site:
Type this into the search box:
What does the first result say?
In Vscode, a suggestion is done (blue wiggely line) to use
mapMaybe
.
Hoover over the blue line, select the blue dot and
apply hint "use mapMaybe"
Either
typeThe Either
type represents values with two
possibilities:
data Either a b
= Left a
| Right b
Simply put, a value of type Either a b can contain either a value of type a, or a value of type b.
Left True :: Either Bool b
Right 'a' :: Either a Char
With this type, we can use the Left
constructor to
indicate failure with some error value attached, and the
Right
constructor with one type to represent success with
the expected result.
Either
instance Functor (Either a) where
fmap _ (Left x) = Left x
fmap f (Right y) = Right (f y)
Mnemonic: Left
is wrong and Right
is right,
right?
This enables us to write code without bothering about previous errors. More on error handling can be found here.
SomeException
which is the ancestor of all
exceptions.data MyException = ThisException | ThatException
deriving Show
instance Exception MyException
These Exceptions are commonly used as type for the Left
case.
If we can download the links from one page, we can easily write a spider to follow those links.
To keep things simple, let’s set a limit on the number of pages we’ll download.
What information do we want to generate?
What do we need to track along the way?
Here’s the state we need to maintain:
The number of pages we have downloaded
A collection of pages we have seen links to, but haven’t downloaded
A collection of pages and their outbound links
For any given page, we need to remember both it and all the pages it links to.
One possibility for associating the two is a tuple:
Tuples are useful any time we want mixed-type data without the hassle of creating a new type.
Speaking of a new type, here’s how we’d define one:
We don’t want to visit any URL twice.
How do we avoid this?
This function has a problem - what is that problem?
We really want a structure with a fast lookup operation.
What would you use in your language?
In Haskell, we have mutable hash tables, but we don’t use them.
Instead, we use immutable key-value maps.
We must perform fancy module importing tricks because the
Data.Map
module defines a lot of names that would otherwise
overlap with built-in names.
This means “only import the name Map
from
Data.Map
”:
And this means “import everything from Data.Map
, but all
those names must be prefixed with Map.
”:
Everyone knows how to add a key and value to a hash table, right?
And that seems like a fundamental operation.
What do we do with maps?
How can this possibly work? Is it efficient?
Here’s a surprisingly handy built-in operator:
Why is this useful? Because it lets us eliminate parentheses.
Before:
After:
This is annoying to write:
Almost as bad:
Much handier, and identical:
In fact, this is valid:
spider :: Int -> String -> IO (Map.Map String [String])
spider count url0 = go 0 Map.empty (Set.singleton url0)
where
go k seen queue0
| k >= count = return seen
| otherwise =
case Set.minView queue0 of
Nothing -> return seen
Just (url, queue) -> do
request <- parseRequest url
page <- download request
let ls = links url page
newSeen = Map.insert url ls seen
notSeen = Set.fromList .
filter (`Map.notMember` newSeen) $ ls
newQueue = queue `Set.union` notSeen
go (k+1) newSeen newQueue
We can now:
Download a web page
Extract its links
Spider out from there, without repeat visits
What remains?
We could spider multiple pages concurrently
Or we could compute which pages are “important”
At this point, if we have miraculously not run out of time, we’re going on a choose-your-own-adventure session in Emacs.
Thanks for sticking with the slideshow so far!