Data Processing and Statistics

Learning goals:

After this week, the student will be able to:

  • Read and write files using Python

  • Calculate the expected value, variance and standard deviation of a set of numbers

  • Write programs using user input

  • Solve decision problems using expected values

Statistical measures

Before the lecture you should read these excerpts covering An Introduction to Expected Value and Decision.

The lecture will focus on how we estimate expected value from measurements, so you should first have a grasp on how we find the expected value exactly when we know the underlying distribution, and how we can reason using expected values.

Reading and Writing Text Files

The standard way to access text files in Python is using the built-in function open. This function takes two arguments; the first argument is the name of the file, and the second argument specifies whether the file shall be read or written to. If the second arguments is "r", it means we want to open the file for reading, whereas giving "w" means we want to open the file for writing. If some text is to be appended to the contents of an already existing file, we can use "a" as an argument. Giving "w+" as the second argument to open will tell the machine that the file is to be both written to and read. The default option for the second arguments is "r", so if we don’t provide a second argument to open, Python will assume that we want to be in read-mode.

Writing Text to a File

Writing to a text file in Python can be done by using the built-in function open, and giving the string "w" as the second argument to the open-function. In the following example, a file with the name textfile.txt is created. The file name extension .txt is customary for text files, but we could in principle give the file any extension we want.

ofile = open("textfile.txt", "w")

The command above creates a file on the computer, called textfile.txt, and creates a File object that we call ofile. File objects is something we use in Python to handle files. We want to proceed by filling this file with some information. Below we fill the file with the sentence “This is line number” and then the number of the line. This is done by creating a string constaining the quote, and writing this string to the file textfile.txt through the write-function on the File-object, as shown below.

for i in range(10):
    ofile.write(f"This is line number {i+1:.0f}\n")

When we are finished writing to the file, we close it using the close-function on the File object.

ofile.close()

This makes sure that everything we wanted to write to the file is actually written there, and it can prevent slowdown when very large files which are opened are no longer needed. This is similar to ejecting a USB stick in the operating system before actually detaching it from the computer physically.

Reading Text from a File

If we want to analyze texts, we usually want to read files, rather than writing them. To read files, we use the built-in Python function open, with "r" as the second argument, to create a File-object that reads a particular file. We can use the read-function on the File-object to get the entire contents of the file returned as a single string. This is shown in the example below, which opens the file that we just created in the last section.

ifile = open("textfile.txt", "r")
content = ifile.read()
print(content)

ifile.close()
This is line number 1
This is line number 2
This is line number 3
This is line number 4
This is line number 5
This is line number 6
This is line number 7
This is line number 8
This is line number 9
This is line number 10

If you only want to read a single line, you can use the readline function.

ifile = open("textfile.txt", "r")
content = ifile.readline()
print(content)
This is line number 1

The File-object reads from the beginning, and cannot re-read something it has read.

print(ifile.readline())
print(ifile.read())
print("The file has reached the end, so nothing will be printed below")
print(ifile.read())

ifile.close()
This is line number 2

This is line number 3
This is line number 4
This is line number 5
This is line number 6
This is line number 7
This is line number 8
This is line number 9
This is line number 10

The file has reached the end, so nothing will be printed below

When we ask Python to open the file textfile.txt, Python looks for this file in the same folder as where we are running Python from. We typically run from the same path as where the .py file is saved. Therefore, we usually want to have the text file in the same folder as the Python program reading it. If your text file is close to your python script, but not in the same folder, you can use the relative path to the file to open it. For instance: Poems/textfile.txt.

If the text file is somewhere completely different, you can tell the computer explicitly where to look for it. If the file "textfile.txt" lies in the Documents-folder on your computer, you can (on unix-based operating systems) use the absolute path "/Users/username/Documents/textfile.txt", as the first argument of the open-function. The username is the name of the user at the computer you are using.


In-class exercises

a) Read the file zarathustra.txt from the course webpage. Use the split function to divide the text into a list of words, and then count the number of occurences of the name “Zarathustra”. Note that “Zarathustra” and “zarathustra” are regarded as different words by the computer.

b) Read the file prince.txt from the course webpage. Use various string operations to clean the text in the following way

  1. Make all letters lowercase

  2. Remove special signs such as .,:; so that they don’t occur at the end of words.

  3. Split into a list of words

  4. Count the number of occurences of the following words 1. “prince” 2. “dante” 3. “pope”


Working with data

We have now learned how to read text from a file into python. Next we need to actually do something useful with it.

Given a text, we might be interested in how long the words are, if it is different for books written for adults and children. Are words typically longer in norwegian than in english? Has it changed over time? How diverse are the words? How long are the sentences?

In this section, we will go through how to calculate the average of a list of numbers, and later the standard deviation.

First, we get hold of a list of words.

ifile = open("assets/prideandprejudice.txt", "r", encoding="utf-8") # the text file is found on the course website under "Extra Resources"
text = ifile.read()
text = text[text.find("#START") + 6:] # removing all text before the actual contents of the book
# making the text easy to work with
for sign in ["?", "!", ".", "-", ",", "“", "”", "_", ":", "(", ")"]:
    text = text.replace(sign, " ") # removing unwanted characters
text = text.lower() # we do not care about the case
words = text.split() # making a list of words

print(words[:25])
['chapter', '1', 'it', 'is', 'a', 'truth', 'universally', 'acknowledged', 'that', 'a', 'single', 'man', 'in', 'possession', 'of', 'a', 'good', 'fortune', 'must', 'be', 'in', 'want', 'of', 'a', 'wife']
ifile.close()

Then we create a list containting the lengths of all the words.

lengths = []
for word in words:
    lengths.append(len(word))
from numpy import mean    
print(mean(lengths))
4.421409125200905

Printing the first 25 numbers to make sure it worked:

print(lengths[:25])
[7, 1, 2, 2, 1, 5, 11, 12, 4, 1, 6, 3, 2, 10, 2, 1, 4, 7, 4, 2, 2, 4, 2, 1, 4]

The length of the first 1000 words

Here are the lenghts of the first 1000 words if you need a list of numbers to do some statistics on.

lengths = [7, 1, 2, 2, 1, 5, 11, 12, 4, 1, 6, 3, 2, 10, 2, 1, 4, 7, 4, 2, 2, 4, 2, 1, 4, 7, 6, 5, 3, 8, 2, 5, 2, 4, 1, 3, 3, 2, 2, 3, 5, 8, 1, 13, 4, 5, 2, 2, 4, 5, 2, 3, 5, 2, 3, 11, 8, 4, 2, 2, 10, 3, 8, 8, 2, 4, 3, 2, 5, 2, 5, 9, 2, 4, 2, 6, 4, 3, 4, 2, 3, 3, 3, 4, 3, 5, 4, 11, 4, 2, 3, 2, 4, 2, 6, 7, 4, 2, 3, 3, 3, 2, 2, 8, 4, 3, 3, 4, 3, 4, 4, 4, 3, 3, 4, 2, 3, 5, 2, 2, 6, 4, 2, 6, 2, 3, 3, 4, 2, 4, 3, 3, 5, 2, 5, 3, 4, 11, 3, 4, 2, 4, 2, 3, 1, 4, 2, 9, 2, 7, 2, 4, 3, 10, 6, 3, 2, 4, 3, 4, 4, 3, 4, 4, 4, 11, 2, 5, 2, 1, 5, 3, 2, 5, 7, 4, 3, 5, 2, 8, 4, 2, 4, 4, 2, 6, 2, 1, 6, 3, 4, 2, 3, 3, 5, 3, 3, 2, 4, 9, 4, 2, 4, 2, 6, 4, 2, 6, 12, 4, 2, 2, 2, 4, 10, 6, 10, 3, 4, 2, 3, 8, 3, 2, 2, 2, 3, 5, 2, 3, 3, 2, 4, 4, 4, 2, 3, 4, 7, 2, 2, 7, 2, 6, 2, 6, 2, 4, 2, 2, 4, 1, 6, 3, 2, 5, 8, 4, 2, 4, 8, 1, 4, 4, 1, 4, 5, 3, 3, 5, 3, 2, 3, 3, 2, 6, 4, 2, 4, 2, 6, 7, 3, 4, 3, 3, 3, 2, 2, 8, 3, 4, 4, 4, 1, 2, 8, 2, 3, 8, 3, 2, 4, 2, 4, 3, 6, 2, 8, 4, 6, 8, 3, 3, 3, 4, 2, 3, 2, 2, 4, 6, 4, 2, 3, 4, 2, 4, 4, 3, 2, 4, 3, 9, 3, 4, 5, 3, 2, 4, 2, 2, 5, 1, 3, 2, 8, 3, 4, 3, 3, 3, 5, 3, 2, 2, 3, 3, 4, 4, 2, 10, 5, 7, 4, 2, 5, 6, 3, 2, 3, 3, 2, 8, 2, 3, 2, 4, 2, 7, 3, 4, 3, 3, 4, 2, 3, 5, 2, 4, 3, 7, 2, 1, 9, 4, 3, 2, 5, 2, 6, 3, 1, 2, 3, 7, 2, 2, 8, 13, 3, 4, 1, 5, 3, 4, 5, 2, 9, 3, 5, 2, 4, 4, 8, 2, 3, 3, 6, 2, 4, 5, 1, 5, 3, 3, 5, 4, 6, 2, 5, 2, 3, 2, 4, 3, 4, 6, 2, 3, 3, 2, 7, 4, 2, 5, 4, 3, 13, 2, 2, 4, 4, 1, 6, 3, 1, 6, 3, 3, 8, 4, 9, 4, 5, 4, 2, 13, 2, 5, 2, 3, 3, 2, 4, 3, 7, 3, 4, 5, 3, 10, 2, 2, 6, 2, 4, 7, 3, 2, 7, 3, 4, 4, 5, 2, 9, 6, 3, 4, 2, 3, 2, 4, 2, 10, 3, 2, 2, 5, 3, 2, 3, 2, 3, 3, 3, 4, 10, 6, 1, 4, 3, 2, 7, 4, 2, 4, 4, 2, 3, 4, 3, 1, 4, 4, 1, 3, 5, 2, 3, 2, 6, 3, 2, 2, 6, 7, 2, 3, 8, 9, 2, 7, 2, 3, 6, 6, 1, 4, 5, 2, 1, 4, 4, 3, 2, 6, 5, 1, 6, 3, 4, 2, 2, 4, 5, 5, 2, 3, 1, 3, 6, 4, 3, 7, 3, 1, 2, 4, 3, 2, 3, 4, 2, 8, 2, 4, 3, 4, 2, 4, 8, 2, 5, 3, 3, 3, 6, 6, 3, 3, 10, 4, 4, 4, 2, 4, 4, 2, 9, 4, 7, 3, 4, 3, 3, 5, 3, 8, 4, 5, 6, 3, 5, 3, 9, 4, 2, 9, 4, 3, 7, 2, 6, 3, 3, 3, 5, 4, 3, 8, 2, 4, 1, 3, 3, 4, 7, 2, 6, 2, 3, 4, 2, 10, 3, 2, 4, 6, 3, 7, 2, 2, 4, 1, 4, 1, 4, 7, 3, 4, 6, 4, 3, 2, 3, 7, 1, 4, 5, 3, 7, 4, 4, 13, 5, 4, 6, 5, 2, 5, 2, 3, 2, 3, 4, 4, 1, 6, 3, 1, 4, 3, 4, 3, 4, 2, 3, 4, 2, 3, 4, 5, 3, 2, 4, 8, 1, 4, 4, 4, 3, 13, 2, 4, 2, 2, 3, 2, 2, 2, 6, 4, 6, 4, 5, 3, 4, 3, 5, 4, 6, 4, 2, 2, 4, 4, 4, 5, 3, 6, 1, 4, 5, 4, 3, 2, 6, 3, 2, 3, 1, 7, 2, 5, 5, 9, 6, 7, 3, 7, 4, 3, 10, 2, 5, 3, 6, 5, 3, 4, 12, 2, 4, 3, 4, 10, 3, 9, 3, 4, 3, 4, 9, 2, 7, 3, 3, 1, 5, 2, 4, 13, 6, 11, 3, 9, 6, 4, 3, 3, 12, 3, 7, 7, 7, 3, 8, 2, 3, 4, 3, 2, 3, 3, 9, 8, 3, 6, 3, 8, 3, 4, 7, 1, 2, 6, 3, 5, 3, 8, 2, 5, 3, 6, 2, 2, 7, 2, 3, 6, 8, 2, 5, 3, 6, 2, 3, 4, 6, 8, 3, 4, 4, 2, 6, 3, 3, 3, 4, 3, 7, 5, 3, 5, 3, 4, 3, 3, 2, 9, 2, 2, 2, 3, 4, 9, 2, 3, 9, 6, 9, 3, 6, 8, 8, 2, 8, 1, 3, 2, 8, 9, 3, 4, 1, 4, 2, 7, 4, 4, 2, 5, 2, 3, 3, 2, 1, 3, 2, 4, 4, 2, 7, 5, 4, 3, 6, 11, 5, 2, 3, 3, 2, 5, 3, 3, 6, 5, 4, 9, 4, 2, 5, 4, 3, 2, 3, 10, 3, 4, 3, 4, 8, 2, 9, 3, 1, 2, 3, 7, 3, 4, 4, 2, 3, 4, 5, 3, 3, 3, 6, 2, 3, 3, 3, 2, 1, 7]


Calculating Expected Value

Before we analyze the lengths of words, we will look at an example which better illustrates the statistical properties expected value and standard deviation calculated on a sample of a larger dataset.

One of the things we can infer from looking at data is the expected value of an event, in this case the expected age of death of a newborn.

To find the expected value of an event from data, we take the average of the values observed.

ages = [2, 11, 23, 64, 89, 91, 10, 20, 95, 8, 36, 100, 84, 62, 6, 16, 7, 89, 19, 62, 19, 27, 94, 80, 4, 21, 68, 97, 64, 72]

n = len(ages) #The number of datapoints

total = 0 #We tally up the total to calculate the average at the end

for age in ages:
    total = total + age
    
EV = total / n

print(f"The expected age is {EV:.1f} years") #round rounds a number to a specified number of decimals, here 1
The expected age is 48.0 years

The data shows that the life expectancy of this population is 48 years. This does not mean that a person will most likely become 48 years old however, only that the average age of many people will be around 48 years. If you look at the data, you can actually see that not a single person died in their 40’s or 50’s, showing that expected value and most likely value are two very different things.

This behaviour is somewhat typical of life expectancy data. Due to high child mortality rates in developing countries, the life expectancy might be very low, while adults becoming old is still quite common. You see this in Sierra Leone where the life expectancy at birth is 52.5 years, while at 5 years old it is 59.7, a much larger increase than you see in Norway, where it goes from 80.6 to 80.8. (Source: worldlifeexpectancy.com)

Keep in mind that we only found the expected value of our sample. The expected value of the actual population can be slightly different. The larger the sample, the more certain we can be that the values we calculate are actually close to the actual values for the population.


In-Class Expected Value Exercises

Alice and Bob both offer you what they call a great oppurtunity to get rich quick. They say that by giving them 10 dollars, you have a good chance of making large gains. Below are example gains from people taking them up on their offers in the past:

alice = [-10, -10, -10, -10, -10, -10, -10, 1000, -10]
bob = [-10, 200, -10, 200, -10, 200, 200, 50, 50]

a) Whose offer has the highest expected gains?

b) How would you describe what you can expect when taking them up on their offer? Does anything but the expected value matter?

c) Calculate the average word length of Pride and Prejudice. Calculate the average word length of The Three Little Pigs text. The text files can be found on the course website under “Extra Resources”.


Calculating Variance and Standard Deviation

To describe this large spread in ages of death we can calculate the variance of the data. To do this we take the average of the squared differences from the expected value(not as complicated as it sounds!). We could have just looked at the average difference from the expected value, but it is often of more interest to make large deviations from the expected value result in very large changes in the variance.

When the data we have is only a sample of the total population, the variance tends to be too small compared to the actual variance of the total population. To account for this, we divide by \(n - 1\) instead of \(n\) when taking the average. The standard deviation is the square root of the variance, and is the value most often used to describe the amount of dispersion in the data.

../_images/VarianceFigure.png

Fig. 1 Example of calculation of sample variance and standard deviation

total = 0

for age in ages:
    squaredDifference = (age - EV)**2
    total = total + squaredDifference
    
variance = total / (n - 1)

print(f"The variance is {variance:.1f} years squared")
The variance is 1247.6 years squared

This number is very large and not very intuitive (It has units years squared!). The standard deviation is the square root of the variance, and is often much more useful.

std = variance**0.5

print(f"The standard deviation is {std:.1f} years")
The standard deviation is 35.3 years

This number describes how spread out our data is. If all of the ages were in the range 60-80, the standard deviation would be much smaller.


In-Class Variance and Standard Deviation Exercises

You have contracted a deadly disease and have 12 months left to live. There is an operation which has a chance to greatly increase your remaining time here on earth.

This is how long people in your exact position lived after taking the operation:

monthsLeft = [0, 30, 2, 0 , 1, 0, 412, 0, 2, 1, 0, 5, 0, 12, 44, 0, 0, 0, 5, 0, 1, 0, 0, 0, 203, 40, 6, 0, 0, 3, 2, 0, 2, 0, 0, 5]

a) Calculate the expected value of remaining months if you take the operation.

  • Should you take the operation?

b) Calculate the sample variance of the remaining months if you take the operation.

c) Calculate the standard deviation of the remaining months if you take the operation.

  • Is the operation risky?

d) How large a percentage died within 0 months of taking the operation?

e) Calculate the variance of the expected word length of a text.


User Input - Communicating with your program

Python has a built in funtion called input() which prints a message to the terminal and then returns whatever the user writes in the terminal as a string.

name = input("What is your name?: ")
print("Hello " + name)
What is your name?:  Karl Henrik
Hello Karl Henrik

If you want to input a number and then use it for calculations, you need to turn the string into an integer or float first.

Type conversion

You can turn one type of variable into another type like this:

myString = "15"
myInt = int(myString) #The string has to contain a whole number! "15.2" would not work
myString2 = str(myInt)
myFloat = float(myInt)
myFloat2 = float(myString) #The string has to contain only a number! "Hey 24.4" would not work, "24.4" would

Turning a float into an integer can often be useful, because some functions will not accept a float as an input, even if it is a whole number.

To input a number and save it as an integer, you could do something like this:

numberAsString = input("How old are you?")

number = int(numberAsString)
number = number + 5

print(f"In five years, you will be {number} years old")
How old are you? 23
In five years, you will be 28 years old

User input can quickly make a program take much longer to write and test. If you want to include it, you should wait until everything else works.


In-Class User Input Exercises

a) Write a program that asks you for your name, and prints it out with a greeting.

b) Write a program that takes a number as a user input, converts it to an integer, and prints out twice the number.