Wednesday, March 28, 2012

csci133c7.py

or known as csci133cleanup.py

In this tutorial we will write a program that clean up the string, it is one of the most classic program. Almost every student will be given a novel text file or input text file and ask them to do something on the data. So the first thing is to "open and load" the text file, and get the English letters into a new string. This tutorial looks long, because I included the full source code of every single program, but in fact it is only minor changes. Read on!
# Version 1 of csci133cleanup.py
# Full implementation of cleanup
wordList = [] # Create a list to store our words
abc = 'abcdefghijklmnopqrstuvwxyz'

with open('novel.text') as book:
    for line in book:
        cleanline = ''
        for character in line.tolower():
            if character in abc:
                cleanline += character
            else:
                # Important! We have append a space!
                cleanline += ' '
        for word in cleanline.split():
            if word not in wordList:
                wordList.append(word)
The first version we are only cleaning up the string text, so there are nothing too special about it. But notice, on line 18, we appended a space to it. Why? Take a moment to think about it, or try to clean 'Doctor--John' on a piece of paper.

Answer: Because we need this mechanism to separate possible words, for example, here is a string Doctor--John. If we did not append a space, we will get 'DoctorJohn' in one word. When we want every single word in the file, we want to separate them instead of keeping them as the same one.
# without space append: Doctor--John, result in DoctorJohn
# with space append: Doctor--John, result in Doctor  John (YES!)
Of course this is not without its problem, for example, we will be left with a lot of 's', so we will want to check if it is already in the list or not. (See line 16), if they are in the list, we might not want to append it again. *Depend on your need, maybe you can add a line number to it. See the next example.
# Version 2 of csci133cleanup.py
# Insert the line numbers into the dictionary
wordList = {} # Create a dictionary to store them
abc = 'abcdefghijklmnopqrstuvwxyz'

with open('novel.text') as book:
    for line in book:
        lineNumber = 1 # Starting at line 1
        cleanline = ''
        for character in line.tolower():
            if character in abc:
                cleanline += character
            else:
                # Important! We have append a space!
                cleanline += ' '
        for word in cleanline.split():
            if word in wordList:
                wordList[word].append(lineNumber)
            else:
                # Store the value as a list that contain 1 item
                wordList[word] = [lineNumber]
        lineNumber += 1
Take a moment to read and compare the code. The very first line is different. We are using a dictionary instead of list. Because when we want to check if the item is in the dictionary already or not, we want to use its build in function, instead of going them one by one. And the other difference is, we are now appending the line number into a list of them. There is an interesting part to it, See line 21.
wordList[word] = [lineNumber]
Notice, we can not use wordList[word] = lineNumber. Because we are creating the first value for the dictionary's key. We instead will create this value as a list that contain one integer. I actually did not aware of this when I was learning python, I keep running into error, because I only used a single interger. And when I try to append to this single integer, it does not work.

The last version we want to search it, we want to look up our dictionary we just created. Take a look at the last couple of lines.
# Version 3 of csci133cleanup.py
# This version include part 1 - 3
wordList = {}
abc = 'abcdefghijklmnopqrstuvwxyz'

with open('novel.text') as book:
    lineNumber = 1
    for line in book:
        cleanline = ''
        for character in line.tolower():
            if character in abc:
                cleanline += character
            else:
                cleanline += ' '
        for word in cleanline.split():
            if word in wordList:
                # do something, such as append line number
                wordList[word].append(lineNumber)
            else:
                wordList[word] = [lineNumber]
        lineNumber += 1

while True:
    word = input('Enter a word here: ' )
    if word in wordList:
         print('Found on lines:, wordList[word])
    else:
         print('Not found.')
wordList = {'apple':[2, 25, 55, 100], 'banana':[5, 10, 36, 90]' ...}
This is the first time we see a while statement in python, the structure of the while loop is simple. while (condition is true), it will execute all the code within it once, and then check if the condition is true, if it is true, do it again, if it is not, it will exist and go to the next statement. See we have 'True' as the condition, that means this loop will run forever, until we kill it with keyboard interrupt.

Keyboard interrupt hot key: Control + C

    No comments:

    Post a Comment