Functions and Dictionaries

Learning goals:

After this week, the student will be able to:

  • Write functions and use them to structure code

  • Use dictionaries to turn text into structured objects

  • Describe the notion of a word in linguistics

Words and tokens

Before the lecture you should read these excerpts covering the notion of a word in linguistics.

Functions

A function is a reusable piece of code. Say you wanted to find the number of times a given word appeares in a text. The following code could solve that problem:

ifile = open("textfiles/prince.txt")
text = ifile.read()
word = "pope"

# making the text easy to work with
for sign in ["?", "!", ".", "-", "\"", ","]:
    text = text.replace(sign, "") # removing unwanted characters
text = text.lower() # we do not care about the case
text = text.split() # making a list of words

# counting the occurence
word_count = 0
for w in text:
    if w == word:
        word_count += 1
        
print(word_count)
33

Now say you wanted to find the word count of a different word, or in a different file. You would have to write this code again for each word and file. Your program would take a long time to write, and it would quickly become unreadable. And if you find an error in your implementation, your would have to update the code everywhere you used it.

The solution to this problem is to write a function which you can reuse each time you want to find the number of times a given word appeares in a text. We will do this later, but let’s first begin with a simpler function.

Here we define the function functionName which takes an input, assigns its value to the variable parameter, multiplies it by 20, adds 5, and returns the resulting value.

def functionName(parameter):
    returnValue = parameter * 20 + 5
    return returnValue

With the function defined, we can call it anywhere in our code just like the built-in functions print andlen we have used before, with parentheses after the function name.

a = functionName(2)
print(a)

b = functionName(4)
print(b)
45
85

Let’s have a closer look at how we defined the function:

First we used the keyword def followed by the function name, a parentheses with the input (parameters) to our function, and then a :. This first line is called the header of the function.

Then we indent the code, like for a for loop. The indented code is run top to bottom whenever the function is called. The indented code is called the body of the function.

The parameter of the function is a variable whose value is given when the function is called (2 and then 4 in the example above).

When the end of the indented code is reached, the function is exited and your code keeps running from wherever the function was called. The return keyword immediately exits the function and makes the function return a given value which can be used wherever the function was called (45 and then 85 in the example above).

Parameters and Arguments

Parameters are the input variables in the parentheses in the header of the function. Arguments are the variables you input to the function when you call it. The distinction is not very important as they are both the information passed into the function.

A function can have multiple or even no parameters. And parameters can be pretty much anything: numbers, strings, lists and even functions.

In our word occurence example from earlier, we might want to create a function which takes both a filename and a word as arguments. This function can look like this:

def word_occurence(filename, word):
    ifile = open(filename)
    text = ifile.read()

    # making the text easy to work with
    for sign in ["?", "!", ".", "-", "\"", ","]:
        text = text.replace(sign, "") # removing unwanted characters
    text = text.lower() # we do not care about the case
    words = text.split() # making a list of words

    word_count = 0
    for w in words:
        if w == word:
            word_count += 1
            
    return word_count

Now it is trivial to change up the filename and word!

print(word_occurence("textfiles/prince.txt", "pope"))
print(word_occurence("textfiles/prince.txt", "please"))
print(word_occurence("textfiles/mitthjerte.txt", "hjerte"))
33
3
5

Parameters can have a default value. If you call a function with arguments missing, the default values will be used. If there are no default values avaliable, you will get an error. Parameters with default values must come last in the function header!

def exampleFunc(a, withDefaultValue = 5):
    return a + withDefaultValue

print(exampleFunc(10))
print(exampleFunc(10, 30))
15
40

Local and global variables

In code, variables are either local or global variables. These differ in a few ways. A local variable is defined inside a function, and cannot be accessed from outside it. A global variable is defined outside a function, and can be accessed from anywhere.

In the example program below, the global variable secret is assigned the value "Nothing", then the global variables public and publicName are also defined.

Then the function is called. The private variable privateName is given the value "Luke", and the private variable secret is given the value "I am your father".

The print statement uses the global variable public, since anything can access it.

Then, outside the function, secret has been unchanged by the function, since variables defined in functions cannot be accessed outside of them. And privateName is not defined outside the function, which gives us an error.

def sayHello(privateName):
    secret = "I am your father"
    print(f"Hello {privateName}, {public}")

secret = "Nothing"
public = "it is Wednesday"
publicName = "Luke"
sayHello(publicName)

print(secret)
print(privateName)
Hello Luke, it is Wednesday
Nothing
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-3-dcb3f32213c2> in <module>
      9 
     10 print(secret)
---> 11 print(privateName)

NameError: name 'privateName' is not defined

You should try to never use global variables inside of functions! Only use the parameters of the function, and variables you define inside of the function.

Multiple return values

When multiple values are returned in a function, they are given in a tuple. A tuple is a variable with multiple values, like a list. It is defined with values seperated by commas, and can be unpacked by assigning it to variable names seperated by commas.

a = 3, 4, 6 # a is a tuple
print(a)
b, c, d = a # unpacking the values of a to b, c and d
print(f"{b=} {c=} {d=}") # printing the names and values of b, c and d
(3, 4, 6)
b=3 c=4 d=6

An example function which returns multiple values is shown below.

def giveNumbers():
    return 1, 2, 3, 4

a, b, c, d = giveNumbers()
print(c)
3

A function can also return nothing! A variable with no value “has the value” None.

def giveNothing():
    print("I ran!")
    
a = giveNothing()
print(a)
I ran!
None

We can expand our function from earlier to now also return the percentage of words in the text which are the given word.

def word_occurence(filename, word):
    ifile = open(filename)
    text = ifile.read()

    # making the text easy to work with
    for sign in ["?", "!", ".", "-", "\"", ","]:
        text = text.replace(sign, "") # removing unwanted characters
    text = text.lower() # we do not care about the case
    words = text.split() # making a list of words

    word_count = 0
    for w in words:
        if w == word:
            word_count += 1
            
    return word_count, word_count / len(text) * 100 # also returns the percentage occurence!

In-class exercises:

  1. Write a function that takes a string as input, removes all the vowels and returns the resulting string.

  2. Write a function that takes a text file as input argument and returns a list containing the words of the text. Hint: You can use part of the code in the definition of word_occurrence. Test the function on a sample textfile.

  3. How well does your function approximate the linguistic notion of a word? Do you observe any mismatches?


Character encoding

A computer only deals with numbers. Therefore, letters and other characters of a text are saved as numbers by the computer. How these characters are saved differ between languages, and a given character may be represented by different numbers in various languages. This sometimes leads to confusion when dealing with texts of multiple languages. Furthermore, some languages include different characters than others. For example the letters “æ”, “ø” and “å” in Norwegian do not occur in the English language, and will need a set of numbers assigned to them that is not needed to save an English text.

Unicode

The assigning of numbers to letters is called encoding, and texts in different languages typically have different encodings. Examples of character encodings are ASCII and Latin-1. ASCII is the character encoding set typically used for English texts. Latin-1 is a character encoding set containing the letters “æ”, “ø” and “å”, and can therefore be used to translate Norwegian texts into numbers on the computer.

If English was the only language in the world, ASCII would suffice as the only character encoding set. However, this is not the case. The need for a common character encoding set for the majority of the World’s languages has resulted in a character encoding system called Unicode. Unicode includes characters from almost all of today’s written languages, consisting of over a million characters. Thus almost all texts, independent of which language they are written in, can be translated into Unicode.

The process of translating a text of a specific encoding into Unicode, is called decoding. In order to make a correct translation the computer needs to know which encoding the original text is saved by. If the text is encoded in Latin-1 (f.ex. because it is written in Norwegian), this must be given to the computer manually. Then the computer will know exactly how to translate the Latin-1 character encoding into Unicode.

UTF-8 is a character encoding set that can represent the full range of characters in Unicode. It is thus accustomed to a variety of languages, Norwegian included, and is advisable to use when multiple languages are represented in the same text.

The process of translating a given text file into Unicode is shown in the next section.

Reading a text from the Web

Sometimes we wish to read a text from a web-page without having to download the whole text. In the following code we import the text “Teaching of the Twelve Apostles” by Roswell D. Hitchcock and Francis Brown from the free online text collection Project Gutenberg (http://www.gutenberg.org/catalog/). This text is written in Greek.

from urllib import request
url = "http://www.gutenberg.org/files/42053/42053-0.txt" #The url of the web-page containing the text
webpage = request.urlopen(url) #Open the web-page addressed with the given url 
raw = webpage.read().decode("utf8") #Reading the content of the web-page, and translating it from UTF-8 to Unicode

print(f"Type of file: {type(raw)} \n")
print(f"Number of characters in file: {len(raw)} \n")
print(f"Printing a set of characters from the text:\n \"{raw[10000:10400]}\"")
Type of file: <class 'str'> 

Number of characters in file: 65603 

Printing a set of characters from the text:
 "εῖαι, ἁρπαγαί, ψευδο-
μαρτυρίαι, ὑποκρίσεις, διπλοκαρδία, δόλος, ὑπερ-
ηφανία, κακία, αὐθάδεια, πλεονεξία, αἰσχρολο-
γἰα, ζηλοτυπία, θρασύτης, ὕψος, ἀλαζονεία·
διῶκται ἀγαθῶν, μισοῦντες ἀλήθειαν, ἀγαπῶν-
120
τες ψεῦδος, οὐ γινώσκοντες μισθὸν δικαιο-
σύνης, οὐ κολλῶμενοι ἀγαθῷ οὐδὲ κρίσει δι-
καίᾳ, ἀγρυπνοῦντες οὐκ εἰς τὸ ἀγαθόν, ἀλλ᾿
εἰς τὸ πονηρόν· ὧν μακρὰν πραΰτης καὶ ὑπο-
μονή, μάται"

When defining the variable “raw”, we told the computer that the text it is dealing with is encoded in UTF-8. Without this command the computer will not translate all characters of the text correctly. You may try to do this yourself, copying the code above and removing .decode("utf8").


In-class exercises:

In the following exercises use the function word_occurence defined above.

  1. Read the text file “mitthjerte.txt”, containing the poem “Mitt Hjerte” of Jens Bjørneboe, using the built-in open-function in Python. Use the argument “encoding=’utf8’” in the open-function, and find the occurrence of the words “hjerte”, “foreldreløst”, “gård”, “far” and “mor”. Now use instead the argument “encoding=’Latin2’” and find the occurrence of the same set of words. Do you see any difference in the occurrence of the words, and if so, why?

  2. Read the book “The Picture of Dorian Gray” by Oscar Wilde from the web page http://www.gutenberg.org/cache/epub/174/pg174.txt. Find the occurrence of the words “the”, “Dorian”, “man” and “beauty”.


Dictionaries

A list in Python can be thought of as a mapping between integers, making up the list indices, and each element of the list. Similarily, a dictionary can be viewed as a generalisation of a list in the sense that the elements can be mapped to any immutable data type in Python. In other words, the indices of a dictionary are not restricted to being numbers, but can be any constant Python object. These “indices” are referred to as keys. In this course, strings will typically serve as dictionary keys.

Say we want to make a mapping between the following countries and capitals:

“Norway” → “Oslo”
“Denmark” → “Copenhagen”
“Sweden” → “Stockholm”

In the format of a dictionary, this mapping is neatly represented as

d = {"Norway":"Oslo", "Denmark":"Copenhagen", "Sweden":"Stockholm"}
print(d)
{'Norway': 'Oslo', 'Denmark': 'Copenhagen', 'Sweden': 'Stockholm'}

Dictionaries are represented with curly brackets, {}, and keys and values (elements) are coupled in pairs by a colon, :. The key is positioned to the left of the colon, and the value to the right. Finding elements in the dictionary can now be executed by making use of the keys as indices. For example, if we want to know which city is mapped together with the country “Denmark”, we can write

d["Denmark"]
'Copenhagen'

If we want to extend this list with the capital city of Finland, we can write

d["Finland"] = "Helsinki"
print(d)
{'Norway': 'Oslo', 'Denmark': 'Copenhagen', 'Sweden': 'Stockholm', 'Finland': 'Helsinki'}

If we want to list the keys in the dictionary, we can use the function keys in the following way

d.keys()
dict_keys(['Norway', 'Denmark', 'Sweden', 'Finland'])

The elements of the dictionary can similarily be found by making use of the function values.

d.values()
dict_values(['Oslo', 'Copenhagen', 'Stockholm', 'Helsinki'])

The keys and values are not ordered in lists, but in the respective Python data types called dict_keys and dict_values. Thus, if one desires to put the keys and the values in two different lists, one must use the list-function.

key_list = list(d.keys())
print("List of keys:\n", key_list, "\n\nStructure of key_list:\n", type(key_list))
List of keys:
 ['Norway', 'Denmark', 'Sweden', 'Finland'] 

Structure of key_list:
 <class 'list'>

Another possible way to create dictionaries in Python is to use the built-in function dict. Making the same dictionary as above with this approach is done in the following manner:

d_alt = dict(Norway="Oslo", Denmark="Copenhagen", Sweden="Stockholm", Finland="Helsinki")
print(d_alt)

Notice that the syntax of the keys and values differs in the dict-approach from the approach using curly brackets to make dictionaries. In this case, the keys are not written in string-format, but will automatically be translated to strings. Additionally, the =-sign serves as the link between the key and its corresponding value.

There are numerous operations that can be done on dictionaries, many of which coincide with the well-known list-operations in Python. For example, one can loop over the elements in a dictionary, as shown below

for key in d:
    print(f"The capital city of {key} is {d[key]}")

If we want to sort the keys in alphabetic order, we can make use of the function sorted. This function makes a list of the dictionary’s keys in alphabetic order.

alph_keys = sorted(d)
print(alph_keys)

In-class exercises

  1. Make a program that takes a writer as user-input, and gives out a famous text written by the writer as an answer. Organize the pairs of writers and texts in a dictionary, as in the example above. The set of writers and texts is given in the list below. If the writer given as input is not in this list, the program should return a message telling which writers one can choose between.

    • Fitzgerald - The Great Gatsby

    • Hugo - Les Misérables

    • Ibsen - Et dukkehjem

    • Dostoyevsky - The Brothers Karamazov

  2. Dictionaries are a flexible data structure that can contain many different types of data. Consider the following data and determine what the type of the keys and the values should be in each case. Construct a sample toy dicationary that contains data of the correct type (two entries are enough):

    • counts of the frequency of different words in a text

    • the ten biggest cities of each country in the world

    • the population of the US states