• Last week we discovered functions
  • Functions are a way to group code in a way that makes the code callable
  • Functions are an essential part of every program.
  • Functions are defined with the def keyword
# This is the definition of a function
def my_function() : 
  print ('Hello function world!') 
# This is how you call my_function()
  • Functions take arguments
    • Arguments pass information from the caller to the function
  • Functions return values
    • Return values pass information back from the function to the caller

Here's an example of a function with arguments:

def my_function(arg1, arg2, arg3) : 
  print ("My arguments are:", arg1, arg2, arg3)
  return "value1", "value2"
return_val1, return_val2 = my_function('parameter_val1', 'parameter_val2', 'parameter_val3')

You should have a solid understanding of functions before you continue. If you're a little shaky about parameter and return values be sure to get help. All of our assignments going forward will use functions.

Character Encoding

  • So far we've used strings extensively.
  • Strings can be literals.
    • “Hello World”
  • Strings can be stored in variables:
    • greeting = “Hello World”
  • When you print something using the print() function Python formats a string for the terminal output.
  • HTML documents are strings.
  • What we haven't talked about is strings in other languages (which, of course, are also strings)
  • In this lecture you'll see what strings really are and how other languages are represented as strings.

Using wget to Download Files

  • The wget utility can fetch a file from a URL from the command line.
  • wget exists on Windows, Mac and Linux
  • To download a file using wget simply call it from the command line with a URL.

Here's how to fetch the text file from today's lesson in your workspace:

$ wget 
  • wget places the file in the current directory.
  • Most early programming languages were written by English speakers.
  • Think about Python's keywords and functions:
    • print
    • import
    • def (short for define)
    • format
  • They're all English words!
  • But, Guido van Rossum, the author of Python is Dutch!
  • Python is a modern programming language
    • It has a built-in understanding of foreign languages
    • Since early languages like C and C++ all languages are built this way.
  • Computers only understand binary.
    • Binary is a counting system that uses only 1's and 0's.
    • Given that, computers need a way to understand letters
  • Letters are encoded using bytes
    • A byte is 8 binary digits (or bits)
    • A byte can encode the numbers 0 through 255
    • 00000000 = 0
    • 00000001 = 1
    • 00000010 = 2
    • 00000011 = 3
    • 11111111 = 255
  • In order to get numbers from letters you need a coding system.
  • The most prevalent coding system for computers is ASCII
  • The ord() and chr() functions can translate between letters and numbers.

Here's an example of using the ord() and chr() functions in Python:

>>> ord('a')
>>> chr(97) 
  • The ord() function takes a letter and turns the letter into it's numerical code
  • The chr() function does the opposite of ord()
  • There are way more characters in all human languages than 256.
  • So people in other countries started making their own coding systems (like Big5 in China)
  • The UTF coding systems are a world wide standard for encoding characters from every language
    • UTF-8 uses 8-bit (one byte) characters and can switch between encodings giving it the ability to have 1,112,064 characters.
    • UTF-16 uses 16-bit (two byte) characters and can represent 65 thousand characters
    • UTF-32 uses 32-bit (four byte) characters and can represent 4 billion characters.
  • UTF supports emojis too!
  • ord() and chr() only support one-byte characters.
  • For multi-byte characters you need the encode() and decode() functions.
  • The encode() function takes a string and encodes it as raw bytes:
>>> "Hello World".encode('ascii') 
b'Hello World'
>>> "Hello World".encode('utf-8') 
b'Hello World'
>>> "Hello World".encode('big5') 
b'Hello World'
>>> "Hello World".encode('utf-16') 
b'\xff\xfeH\x00e\x00l\x00l\x00o\x00 \x00W\x00o\x00r\x00l\x00d\x00'
>>> "Hello World".encode('utf-32') 
b'\xff\xfe\x00\x00H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00 \x00\x00\x00W\x00\x00\x00o\x00\x00\x00r\x00\x00\x00l\x00\x00\x00d\x00\x00\x00'

Notice that the ASCII, UTF-8 and Big5 encodings are all the same!

Bytes and Hexadecimal

  • We've used standard strings in Python.
    • Those are enclosed with one of the quote characters we've seen so far.
  • A raw-byte string starts with a b and a familiar quote syntax.
  • For example
    • b'Raw Bytes'
    • b"Raw Bytes'“' * b"""Raw Bytes'”“”'
  • What happens when you have a value that doesn't correspond to a character?
  • You need hexadecimal
  • Hexadecimal is a counting system that's base 16
    • It's the same as decimal, except with the added numbers:
      • a = 10
      • b = 11
      • c = 12
      • d = 13
      • e = 14
      • f = 15
  • Counting in hex is convenient because each hex digit is exactly four bits
  • Decimal digits are an non-integer number of bits so it's hard to convert from binary to decimal and back.
  • You can ask for a particular byte value in a b-string using the \x00 syntax.
    • The two digits past the \x can be any hex value.
>>> b'\xF0\x9F\x98\xB8'.decode('utf-8')

It's a smiling cat face! You can assign b-strings to variables.

>>> catface = b'\xF0\x9F\x98\xB8'
>>> catface
>>> type(catface)
<class 'bytes'>
>>> catface.decode('utf-8') 
  • Not all bytes are valid in all coding schemes.
  • An error occurs when you send invalid bytes to a coding scheme.

The old ASCII coding scheme can't decode our cat face:

>>> catface.decode('utf-8') 
>>> catface.decode('ascii') 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position 0: ordinal not in range(128)
  • Sometimes you can decode a byte string but the meaning is wrong.
  • For example, the cat face is only valid in UTF-8.

Look what happens when you decode the bytes using UTF-16:

>>> catface.decode('utf-16') 

That's not what we intended!

  • You can use encode() and decode() to translate between coding schemes.
  • What if we wanted the UTF-16 equivalent of our cat face?
>>> utf16_catface = '😸'.encode('utf-16') 
>>> utf16_catface 

Or we could do this:

>>> b'\xf0\x9f\x98\xb8'.decode('utf-8').encode('utf-16')