Wednesday, January 5, 2011

FYP Post 3: Chaining of multiple commands

I have just completed a new command called chaining, which allows me to segment the raw sentences and break them down into parts of speeches. Here's the code.

>>> document = "The average compact point-and-shoot camera has a 4x or 5x zoom lens in it. Anyone who's ever tried with one of these cameras to get in closer to a player on the field at a sporting event or snap off a shot of a bird without scaring it away knows that such a short range doesn't cut it. The obvious solution is to get something with a longer zoom, but that means a larger camera, too, that might no longer fit in your pocket. Or could it? With more megapixels waning as a marketable feature, increased optical zooms and/or wide-angle lenses are supplanting that spec on compact cameras. If all you want is to get in a little tighter on the action, the 14-megapixel Sony Cyber-shot DSC-W370's 7x optical zoom or 12-megapixel Panasonic Lumix DMC-ZR3's 8x zoom lens should satisfy. Those who are really tired of their short zooms will probably want to step up to a 10x zoom lens like those on the Casio Exilim EX-FH100 and Sony Cyber-shot DSC-H55. They can be a little uncomfortable for smaller pockets, but can definitely fit in a jacket pocket or bag. Reaching a little farther beyond those is the Panasonic DMC-ZS5, which packs a wide-angle lens with a 12x zoom range into what's still a fairly compact body and the 14x Canon PowerShot SX210 IS. There are a couple things to keep in mind about these cameras. Longer lenses, as well as wider ones, typically cause a bit of distortion, which some models correct for when processing photos. Plus, because the lens glass isn't always top quality, the images from these cameras can be soft. Also, generally speaking, the longer the lens, the slower the camera's performance. We've learned not to expect really fast start-up and shot-to-shot times on models with a 10x zoom or greater."
>>> def ie_preprocess(document):
sentences = nltk.sent_tokenize(document)
sentences = [nltk.word_tokenize(sent) for sent in sentences]
sentences = [nltk.pos_tag(sent) for sent in sentences]
return sentences

>>> ie_preprocess(document)
[[('The', 'DT'), ('average', 'JJ'), ('compact', 'NN'), ('point-and-shoot', 'NN'), ('camera', 'NN'), ('has', 'VBZ'), ('a', 'DT'), ('4x', 'CD'), ('or', 'CC'), ('5x', 'CD'), ('zoom', 'NN'), ('lens', 'NNS'), ('in', 'IN'), ('it', 'PRP'), ('.', '.')], [('Anyone', 'NN'), ('who', 'WP'), ("'s", 'VBZ'), ('ever', 'RB'), ('tried', 'VBN'), ('with', 'IN'), ('one', 'CD'), ('of', 'IN'), ('these', 'DT'), ('cameras', 'NNS'), ('to', 'TO'), ('get', 'VB'), ('in', 'IN'), ('closer', 'JJR'), ('to', 'TO'), ('a', 'DT'), ('player', 'NN'), ('on', 'IN'), ('the', 'DT'), ('field', 'NN'), ('at', 'IN'), ('a', 'DT'), ('sporting', 'NN'), ('event', 'NN'), ('or', 'CC'), ('snap', 'VB'), ('off', 'RP'), ('a', 'DT'), ('shot', 'NN'), ('of', 'IN'), ('a', 'DT'), ('bird', 'JJ'), ('without', 'IN'), ('scaring', 'NN'), ('it', 'PRP'), ('away', 'RB'), ('knows', 'VBZ'), ('that', 'IN'), ('such', 'JJ'), ('a', 'DT'), ('short', 'JJ'), ('range', 'NN'), ('does', 'VBZ'), ("n't", 'RB'), ('cut', 'VB'), ('it', 'PRP'), ('.', '.')], [('The', 'DT'), ('obvious', 'JJ'), ('solution', 'NN'), ('is', 'VBZ'), ('to', 'TO'), ('get', 'VB'), ('something', 'NN'), ('with', 'IN'), ('a', 'DT'), ('longer', 'JJR'), ('zoom', 'NN'), (',', ','), ('but', 'CC'), ('that', 'IN'), ('means', 'NNS'), ('a', 'DT'), ('larger', 'JJR'), ('camera', 'NN'), (',', ','), ('too', 'RB'), (',', ','), ('that', 'IN'), ('might', 'MD'), ('no', 'RB'), ('longer', 'RBR'), ('fit', 'JJ'), ('in', 'IN'), ('your', 'PRP$'), ('pocket', 'NN'), ('.', '.')], [('Or', 'CC'), ('could', 'MD'), ('it', 'PRP'), ('?', '.')], [('With', 'IN'), ('more', 'JJR'), ('megapixels', 'NNS'), ('waning', 'VBG'), ('as', 'IN'), ('a', 'DT'), ('marketable', 'JJ'), ('feature', 'NN'), (',', ','), ('increased', 'VBD'), ('optical', 'JJ'), ('zooms', 'NNS'), ('and/or', 'JJ'), ('wide-angle', 'JJ'), ('lenses', 'NNS'), ('are', 'VBP'), ('supplanting', 'VBG'), ('that', 'IN'), ('spec', 'NN'), ('on', 'IN'), ('compact', 'NN'), ('cameras', 'NNS'), ('.', '.')], [('If', 'IN'), ('all', 'DT'), ('you', 'PRP'), ('want', 'VBP'), ('is', 'VBZ'), ('to', 'TO'), ('get', 'VB'), ('in', 'IN'), ('a', 'DT'), ('little', 'RB'), ('tighter', 'NN'), ('on', 'IN'), ('the', 'DT'), ('action', 'NN'), (',', ','), ('the', 'DT'), ('14-megapixel', 'JJ'), ('Sony', 'NNP'), ('Cyber-shot', 'JJ'), ('DSC-W370', '-NONE-'), ("'s", 'VBZ'), ('7x', 'CD'), ('optical', 'JJ'), ('zoom', 'NN'), ('or', 'CC'), ('12-megapixel', 'CD'), ('Panasonic', 'NNP'), ('Lumix', 'NNP'), ('DMC-ZR3', 'NNP'), ("'s", 'POS'), ('8x', 'CD'), ('zoom', 'NN'), ('lens', 'NNS'), ('should', 'MD'), ('satisfy', 'VB'), ('.', '.')], [('Those', 'DT'), ('who', 'WP'), ('are', 'VBP'), ('really', 'RB'), ('tired', 'VBN'), ('of', 'IN'), ('their', 'PRP$'), ('short', 'JJ'), ('zooms', 'NNS'), ('will', 'MD'), ('probably', 'RB'), ('want', 'VB'), ('to', 'TO'), ('step', 'VB'), ('up', 'RP'), ('to', 'TO'), ('a', 'DT'), ('10x', 'CD'), ('zoom', 'NN'), ('lens', 'NNS'), ('like', 'IN'), ('those', 'DT'), ('on', 'IN'), ('the', 'DT'), ('Casio', 'NNP'), ('Exilim', 'NNP'), ('EX-FH100', 'NNP'), ('and', 'CC'), ('Sony', 'NNP'), ('Cyber-shot', 'NNP'), ('DSC-H55', 'NNP'), ('.', '.')], [('They', 'PRP'), ('can', 'MD'), ('be', 'VB'), ('a', 'DT'), ('little', 'RB'), ('uncomfortable', 'JJ'), ('for', 'IN'), ('smaller', 'JJR'), ('pockets', 'NNS'), (',', ','), ('but', 'CC'), ('can', 'MD'), ('definitely', 'RB'), ('fit', 'VB'), ('in', 'IN'), ('a', 'DT'), ('jacket', 'NN'), ('pocket', 'NN'), ('or', 'CC'), ('bag', 'NN'), ('.', '.')], [('Reaching', 'VBG'), ('a', 'DT'), ('little', 'RB'), ('farther', 'RBR'), ('beyond', 'IN'), ('those', 'DT'), ('is', 'VBZ'), ('the', 'DT'), ('Panasonic', 'NNP'), ('DMC-ZS5', 'NNP'), (',', ','), ('which', 'WDT'), ('packs', 'NNS'), ('a', 'DT'), ('wide-angle', 'JJ'), ('lens', 'NN'), ('with', 'IN'), ('a', 'DT'), ('12x', 'CD'), ('zoom', 'NN'), ('range', 'NN'), ('into', 'IN'), ('what', 'WP'), ("'s", 'POS'), ('still', 'RB'), ('a', 'DT'), ('fairly', 'RB'), ('compact', 'JJ'), ('body', 'NN'), ('and', 'CC'), ('the', 'DT'), ('14x', 'CD'), ('Canon', 'NNP'), ('PowerShot', 'NNP'), ('SX210', 'NNP'), ('IS', 'NNP'), ('.', '.')], [('There', 'EX'), ('are', 'VBP'), ('a', 'DT'), ('couple', 'NN'), ('things', 'NNS'), ('to', 'TO'), ('keep', 'VB'), ('in', 'IN'), ('mind', 'NN'), ('about', 'IN'), ('these', 'DT'), ('cameras', 'NNS'), ('.', '.')], [('Longer', 'NNP'), ('lenses', 'NNS'), (',', ','), ('as', 'IN'), ('well', 'RB'), ('as', 'IN'), ('wider', 'NN'), ('ones', 'NNS'), (',', ','), ('typically', 'RB'), ('cause', 'VB'), ('a', 'DT'), ('bit', 'NN'), ('of', 'IN'), ('distortion', 'NN'), (',', ','), ('which', 'WDT'), ('some', 'DT'), ('models', 'NNS'), ('correct', 'VBP'), ('for', 'IN'), ('when', 'WRB'), ('processing', 'VBG'), ('photos', 'NNS'), ('.', '.')], [('Plus', 'NNP'), (',', ','), ('because', 'IN'), ('the', 'DT'), ('lens', 'NNS'), ('glass', 'NN'), ('is', 'VBZ'), ("n't", 'RB'), ('always', 'RB'), ('top', 'VB'), ('quality', 'NN'), (',', ','), ('the', 'DT'), ('images', 'NNS'), ('from', 'IN'), ('these', 'DT'), ('cameras', 'NNS'), ('can', 'MD'), ('be', 'VB'), ('soft', 'VBN'), ('.', '.')], [('Also', 'RB'), (',', ','), ('generally', 'RB'), ('speaking', 'VBG'), (',', ','), ('the', 'DT'), ('longer', 'NN'), ('the', 'DT'), ('lens', 'NN'), (',', ','), ('the', 'DT'), ('slower', 'NN'), ('the', 'DT'), ('camera', 'NN'), ("'s", 'POS'), ('performance', 'NN'), ('.', '.')], [('We', 'PRP'), ("'ve", 'VBP'), ('learned', 'VBN'), ('not', 'RB'), ('to', 'TO'), ('expect', 'VB'), ('really', 'RB'), ('fast', 'JJ'), ('start-up', 'NN'), ('and', 'CC'), ('shot-to-shot', 'JJ'), ('times', 'NNS'), ('on', 'IN'), ('models', 'NNS'), ('with', 'IN'), ('a', 'DT'), ('10x', 'CD'), ('zoom', 'NN'), ('or', 'CC'), ('greater', 'JJR'), ('.', '.')]]

No comments:

Post a Comment