NLP - Sentiment Analysis - Movie Reviews <a name=top></a>

"This is a failure of epic proportions. You’ve got to be a genius to make a movie this bad." (J. Seigel, review of The Bonfire of the Vanities)


In this notebook, you will re-visit Sentiment Analysis using Python.

Our goal is to develop a sentiment analysis model for movie reviews. The dataset contains 50000 movie reviews labeled as either positive or negative. With an accurate sentiment model we'll have the ability to automatically classify new reviews in order to aggregate review data.

CONTENTS

  1. Dataset Information
  2. Preamble
  3. Data Preparation
  4. Bag of Words Processing
  5. Multinomial Naive Bayes
  6. Performance Evaluation
  7. Vader

DATASET INFORMATION<a name=dataset></a>

The dataset Large Movie Review Dataset v1.0 is described in

Potts, Christopher. 2011. On the negativity of negation. In Nan Li and
David Lutz, eds., Proceedings of Semantics and Linguistic Theory 20,
636-659.

Overview

This dataset contains movie reviews along with their associated binary sentiment polarity labels. It is intended to serve as a benchmark for sentiment classification. This document outlines how the dataset was gathered, and how to use the files provided.

Dataset

The core dataset contains 50,000 reviews split evenly into 25k train and 25k test sets. The overall distribution of labels is balanced (25k pos and 25k neg). We also include an additional 50,000 unlabeled documents for unsupervised learning.

In the entire collection, no more than 30 reviews are allowed for any given movie because reviews for the same movie tend to have correlated ratings. Further, the train and test sets contain a disjoint set of movies, so no significant performance is obtained by memorizing movie-unique terms and their associated with observed labels. In the labeled train/test sets, a negative review has a score <= 4 out of 10, and a positive review has a score >= 7 out of 10. Thus reviews with more neutral ratings are not included in the train/test sets. In the unsupervised set, reviews of any rating are included and there are an even number of reviews > 5 and <= 5.

Files

There are two top-level directories [train/, test/] corresponding to the training and test sets. Each contains [pos/, neg/] directories for the reviews with binary labels positive and negative. Within these directories, reviews are stored in text files named following the convention [[id]_[rating].txt] where [id] is a unique id and [rating] is the star rating for that review on a 1-10 scale. For example, the file [test/pos/200_8.txt] is the text for a positive-labeled test set example with unique id 200 and star rating 8/10 from IMDb. The [train/unsup/] directory has 0 for all ratings because the ratings are omitted for this portion of the dataset.


Back to top

PREAMBLE<a name=preamble></a>

As always, we first need to import the appropriate Python modules. For this exercise, we'll use the Natural Language Toolkit (NLTK).

The stemmer and tokenize functions are used for text processing. The vader lexicon is used to analyze the intensity of the sentiments.

In [16]:
import nltk
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
from nltk.sentiment import SentimentAnalyzer
from nltk.sentiment.util import mark_negation, extract_unigram_feats
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem.lancaster import LancasterStemmer
stemmer = LancasterStemmer()

import glob
data_path = 'Data/aclImdb/'

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from nltk.sentiment.vader import SentimentIntensityAnalyzer
In [17]:
nltk.download('all')
[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package biocreative_ppi is already up-to-date!
[nltk_data]    | Downloading package brown to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package brown_tei to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package brown_tei is already up-to-date!
[nltk_data]    | Downloading package cess_cat to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package cess_cat is already up-to-date!
[nltk_data]    | Downloading package cess_esp to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package cess_esp is already up-to-date!
[nltk_data]    | Downloading package chat80 to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package chat80 is already up-to-date!
[nltk_data]    | Downloading package city_database to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package city_database is already up-to-date!
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package comparative_sentences to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package comparative_sentences is already up-to-
[nltk_data]    |       date!
[nltk_data]    | Downloading package comtrans to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package comtrans is already up-to-date!
[nltk_data]    | Downloading package conll2000 to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package conll2000 is already up-to-date!
[nltk_data]    | Downloading package conll2002 to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package conll2002 is already up-to-date!
[nltk_data]    | Downloading package conll2007 to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package conll2007 is already up-to-date!
[nltk_data]    | Downloading package crubadan to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package crubadan is already up-to-date!
[nltk_data]    | Downloading package dependency_treebank to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package dependency_treebank is already up-to-date!
[nltk_data]    | Downloading package dolch to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package dolch is already up-to-date!
[nltk_data]    | Downloading package europarl_raw to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package europarl_raw is already up-to-date!
[nltk_data]    | Downloading package floresta to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package floresta is already up-to-date!
[nltk_data]    | Downloading package framenet_v15 to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package framenet_v15 is already up-to-date!
[nltk_data]    | Downloading package framenet_v17 to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package framenet_v17 is already up-to-date!
[nltk_data]    | Downloading package gazetteers to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package gazetteers is already up-to-date!
[nltk_data]    | Downloading package genesis to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package genesis is already up-to-date!
[nltk_data]    | Downloading package gutenberg to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package gutenberg is already up-to-date!
[nltk_data]    | Downloading package ieer to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package ieer is already up-to-date!
[nltk_data]    | Downloading package inaugural to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package inaugural is already up-to-date!
[nltk_data]    | Downloading package indian to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package indian is already up-to-date!
[nltk_data]    | Downloading package jeita to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package jeita is already up-to-date!
[nltk_data]    | Downloading package kimmo to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package kimmo is already up-to-date!
[nltk_data]    | Downloading package knbc to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package knbc is already up-to-date!
[nltk_data]    | Downloading package lin_thesaurus to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package lin_thesaurus is already up-to-date!
[nltk_data]    | Downloading package mac_morpho to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package mac_morpho is already up-to-date!
[nltk_data]    | Downloading package machado to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package machado is already up-to-date!
[nltk_data]    | Downloading package masc_tagged to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package masc_tagged is already up-to-date!
[nltk_data]    | Downloading package moses_sample to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package moses_sample is already up-to-date!
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package movie_reviews is already up-to-date!
[nltk_data]    | Downloading package names to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package names is already up-to-date!
[nltk_data]    | Downloading package nombank.1.0 to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package nombank.1.0 is already up-to-date!
[nltk_data]    | Downloading package nps_chat to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package nps_chat is already up-to-date!
[nltk_data]    | Downloading package omw to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package omw is already up-to-date!
[nltk_data]    | Downloading package opinion_lexicon to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package opinion_lexicon is already up-to-date!
[nltk_data]    | Downloading package paradigms to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package paradigms is already up-to-date!
[nltk_data]    | Downloading package pil to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package pil is already up-to-date!
[nltk_data]    | Downloading package pl196x to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package pl196x is already up-to-date!
[nltk_data]    | Downloading package ppattach to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package ppattach is already up-to-date!
[nltk_data]    | Downloading package problem_reports to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package problem_reports is already up-to-date!
[nltk_data]    | Downloading package propbank to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package propbank is already up-to-date!
[nltk_data]    | Downloading package ptb to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package ptb is already up-to-date!
[nltk_data]    | Downloading package product_reviews_1 to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package product_reviews_1 is already up-to-date!
[nltk_data]    | Downloading package product_reviews_2 to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package product_reviews_2 is already up-to-date!
[nltk_data]    | Downloading package pros_cons to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package pros_cons is already up-to-date!
[nltk_data]    | Downloading package qc to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package qc is already up-to-date!
[nltk_data]    | Downloading package reuters to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package reuters is already up-to-date!
[nltk_data]    | Downloading package rte to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package rte is already up-to-date!
[nltk_data]    | Downloading package semcor to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package semcor is already up-to-date!
[nltk_data]    | Downloading package senseval to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package senseval is already up-to-date!
[nltk_data]    | Downloading package sentiwordnet to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package sentiwordnet is already up-to-date!
[nltk_data]    | Downloading package sentence_polarity to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package sentence_polarity is already up-to-date!
[nltk_data]    | Downloading package shakespeare to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package shakespeare is already up-to-date!
[nltk_data]    | Downloading package sinica_treebank to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package sinica_treebank is already up-to-date!
[nltk_data]    | Downloading package smultron to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package smultron is already up-to-date!
[nltk_data]    | Downloading package state_union to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package state_union is already up-to-date!
[nltk_data]    | Downloading package stopwords to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package stopwords is already up-to-date!
[nltk_data]    | Downloading package subjectivity to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package subjectivity is already up-to-date!
[nltk_data]    | Downloading package swadesh to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package swadesh is already up-to-date!
[nltk_data]    | Downloading package switchboard to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package switchboard is already up-to-date!
[nltk_data]    | Downloading package timit to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package timit is already up-to-date!
[nltk_data]    | Downloading package toolbox to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package toolbox is already up-to-date!
[nltk_data]    | Downloading package treebank to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package treebank is already up-to-date!
[nltk_data]    | Downloading package twitter_samples to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package twitter_samples is already up-to-date!
[nltk_data]    | Downloading package udhr to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package udhr is already up-to-date!
[nltk_data]    | Downloading package udhr2 to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package udhr2 is already up-to-date!
[nltk_data]    | Downloading package unicode_samples to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package unicode_samples is already up-to-date!
[nltk_data]    | Downloading package universal_treebanks_v20 to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package universal_treebanks_v20 is already up-to-
[nltk_data]    |       date!
[nltk_data]    | Downloading package verbnet to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package verbnet is already up-to-date!
[nltk_data]    | Downloading package webtext to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package webtext is already up-to-date!
[nltk_data]    | Downloading package wordnet to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package wordnet is already up-to-date!
[nltk_data]    | Downloading package wordnet_ic to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package wordnet_ic is already up-to-date!
[nltk_data]    | Downloading package words to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package words is already up-to-date!
[nltk_data]    | Downloading package ycoe to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package ycoe is already up-to-date!
[nltk_data]    | Downloading package rslp to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package rslp is already up-to-date!
[nltk_data]    | Downloading package maxent_treebank_pos_tagger to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package maxent_treebank_pos_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package universal_tagset to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package universal_tagset is already up-to-date!
[nltk_data]    | Downloading package maxent_ne_chunker to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package maxent_ne_chunker is already up-to-date!
[nltk_data]    | Downloading package punkt to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package punkt is already up-to-date!
[nltk_data]    | Downloading package book_grammars to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package book_grammars is already up-to-date!
[nltk_data]    | Downloading package sample_grammars to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package sample_grammars is already up-to-date!
[nltk_data]    | Downloading package spanish_grammars to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package spanish_grammars is already up-to-date!
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package basque_grammars is already up-to-date!
[nltk_data]    | Downloading package large_grammars to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package large_grammars is already up-to-date!
[nltk_data]    | Downloading package tagsets to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package tagsets is already up-to-date!
[nltk_data]    | Downloading package snowball_data to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package snowball_data is already up-to-date!
[nltk_data]    | Downloading package bllip_wsj_no_aux to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package bllip_wsj_no_aux is already up-to-date!
[nltk_data]    | Downloading package word2vec_sample to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package word2vec_sample is already up-to-date!
[nltk_data]    | Downloading package panlex_swadesh to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package panlex_swadesh is already up-to-date!
[nltk_data]    | Downloading package mte_teip5 to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package mte_teip5 is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package perluniprops to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package perluniprops is already up-to-date!
[nltk_data]    | Downloading package nonbreaking_prefixes to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package nonbreaking_prefixes is already up-to-date!
[nltk_data]    | Downloading package vader_lexicon to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package vader_lexicon is already up-to-date!
[nltk_data]    | Downloading package porter_test to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package porter_test is already up-to-date!
[nltk_data]    | Downloading package wmt15_eval to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package wmt15_eval is already up-to-date!
[nltk_data]    | Downloading package mwa_ppdb to
[nltk_data]    |     /files1/home/pboilytmnlp/nltk_data...
[nltk_data]    |   Package mwa_ppdb is already up-to-date!
[nltk_data]    | 
[nltk_data]  Done downloading collection all
Out[17]:
True

Back to top

DATA PREPARATION<a name=prep></a>

The reviews are individually stored in text files, and there are four folders for every combination of training, testing, positive, and negative.

In [18]:
train_docs = []
train_labels = []
    
pos_file_names = glob.glob('{}train/pos/*.txt'.format(data_path))
for file_name in pos_file_names:
    train_docs.append(open(file_name).read())
    train_labels.append(1)
neg_file_names = glob.glob('{}train/neg/*.txt'.format(data_path))
for file_name in neg_file_names:
    train_docs.append(open(file_name).read())
    train_labels.append(0)

The names of the positive reviews are found in an array, just as the names of the negative reviews are.

In [19]:
pos_file_names
Out[19]:
['Data/aclImdb/train/pos/0_9.txt',
 'Data/aclImdb/train/pos/10000_8.txt',
 'Data/aclImdb/train/pos/10001_10.txt',
 'Data/aclImdb/train/pos/10002_7.txt',
 'Data/aclImdb/train/pos/10003_8.txt',
 'Data/aclImdb/train/pos/10004_8.txt',
 'Data/aclImdb/train/pos/10005_7.txt',
 'Data/aclImdb/train/pos/10006_7.txt',
 'Data/aclImdb/train/pos/10007_7.txt',
 'Data/aclImdb/train/pos/10008_7.txt',
 'Data/aclImdb/train/pos/10009_9.txt',
 'Data/aclImdb/train/pos/1000_8.txt',
 'Data/aclImdb/train/pos/10010_7.txt',
 'Data/aclImdb/train/pos/10011_9.txt',
 'Data/aclImdb/train/pos/10012_8.txt',
 'Data/aclImdb/train/pos/10013_7.txt',
 'Data/aclImdb/train/pos/10014_8.txt',
 'Data/aclImdb/train/pos/10015_8.txt',
 'Data/aclImdb/train/pos/10016_8.txt',
 'Data/aclImdb/train/pos/10017_9.txt',
 'Data/aclImdb/train/pos/10018_8.txt',
 'Data/aclImdb/train/pos/10019_8.txt',
 'Data/aclImdb/train/pos/1001_8.txt',
 'Data/aclImdb/train/pos/10020_8.txt',
 'Data/aclImdb/train/pos/10021_8.txt',
 'Data/aclImdb/train/pos/10022_7.txt',
 'Data/aclImdb/train/pos/10023_9.txt',
 'Data/aclImdb/train/pos/10024_9.txt',
 'Data/aclImdb/train/pos/10025_9.txt',
 'Data/aclImdb/train/pos/10026_7.txt',
 'Data/aclImdb/train/pos/10027_7.txt',
 'Data/aclImdb/train/pos/10028_10.txt',
 'Data/aclImdb/train/pos/10029_10.txt',
 'Data/aclImdb/train/pos/1002_7.txt',
 'Data/aclImdb/train/pos/10030_10.txt',
 'Data/aclImdb/train/pos/10031_10.txt',
 'Data/aclImdb/train/pos/10032_10.txt',
 'Data/aclImdb/train/pos/10033_10.txt',
 'Data/aclImdb/train/pos/10034_8.txt',
 'Data/aclImdb/train/pos/10035_9.txt',
 'Data/aclImdb/train/pos/10036_8.txt',
 'Data/aclImdb/train/pos/10037_9.txt',
 'Data/aclImdb/train/pos/10038_10.txt',
 'Data/aclImdb/train/pos/10039_10.txt',
 'Data/aclImdb/train/pos/1003_10.txt',
 'Data/aclImdb/train/pos/10040_10.txt',
 'Data/aclImdb/train/pos/10041_10.txt',
 'Data/aclImdb/train/pos/10042_10.txt',
 'Data/aclImdb/train/pos/10043_10.txt',
 'Data/aclImdb/train/pos/10044_9.txt',
 'Data/aclImdb/train/pos/10045_10.txt',
 'Data/aclImdb/train/pos/10046_9.txt',
 'Data/aclImdb/train/pos/10047_10.txt',
 'Data/aclImdb/train/pos/10048_10.txt',
 'Data/aclImdb/train/pos/10049_8.txt',
 'Data/aclImdb/train/pos/1004_7.txt',
 'Data/aclImdb/train/pos/10050_10.txt',
 'Data/aclImdb/train/pos/10051_10.txt',
 'Data/aclImdb/train/pos/10052_10.txt',
 'Data/aclImdb/train/pos/10053_8.txt',
 'Data/aclImdb/train/pos/10054_10.txt',
 'Data/aclImdb/train/pos/10055_7.txt',
 'Data/aclImdb/train/pos/10056_8.txt',
 'Data/aclImdb/train/pos/10057_9.txt',
 'Data/aclImdb/train/pos/10058_7.txt',
 'Data/aclImdb/train/pos/10059_10.txt',
 'Data/aclImdb/train/pos/1005_10.txt',
 'Data/aclImdb/train/pos/10060_9.txt',
 'Data/aclImdb/train/pos/10061_8.txt',
 'Data/aclImdb/train/pos/10062_10.txt',
 'Data/aclImdb/train/pos/10063_9.txt',
 'Data/aclImdb/train/pos/10064_10.txt',
 'Data/aclImdb/train/pos/10065_9.txt',
 'Data/aclImdb/train/pos/10066_10.txt',
 'Data/aclImdb/train/pos/10067_9.txt',
 'Data/aclImdb/train/pos/10068_8.txt',
 'Data/aclImdb/train/pos/10069_8.txt',
 'Data/aclImdb/train/pos/1006_8.txt',
 'Data/aclImdb/train/pos/10070_9.txt',
 'Data/aclImdb/train/pos/10071_9.txt',
 'Data/aclImdb/train/pos/10072_9.txt',
 'Data/aclImdb/train/pos/10073_10.txt',
 'Data/aclImdb/train/pos/10074_9.txt',
 'Data/aclImdb/train/pos/10075_9.txt',
 'Data/aclImdb/train/pos/10076_9.txt',
 'Data/aclImdb/train/pos/10077_10.txt',
 'Data/aclImdb/train/pos/10078_8.txt',
 'Data/aclImdb/train/pos/10079_8.txt',
 'Data/aclImdb/train/pos/1007_10.txt',
 'Data/aclImdb/train/pos/10080_10.txt',
 'Data/aclImdb/train/pos/10081_9.txt',
 'Data/aclImdb/train/pos/10082_10.txt',
 'Data/aclImdb/train/pos/10083_7.txt',
 'Data/aclImdb/train/pos/10084_10.txt',
 'Data/aclImdb/train/pos/10085_10.txt',
 'Data/aclImdb/train/pos/10086_7.txt',
 'Data/aclImdb/train/pos/10087_10.txt',
 'Data/aclImdb/train/pos/10088_10.txt',
 'Data/aclImdb/train/pos/10089_7.txt',
 'Data/aclImdb/train/pos/1008_10.txt',
 'Data/aclImdb/train/pos/10090_8.txt',
 'Data/aclImdb/train/pos/10091_7.txt',
 'Data/aclImdb/train/pos/10092_8.txt',
 'Data/aclImdb/train/pos/10093_7.txt',
 'Data/aclImdb/train/pos/10094_7.txt',
 'Data/aclImdb/train/pos/10095_7.txt',
 'Data/aclImdb/train/pos/10096_7.txt',
 'Data/aclImdb/train/pos/10097_9.txt',
 'Data/aclImdb/train/pos/10098_10.txt',
 'Data/aclImdb/train/pos/10099_10.txt',
 'Data/aclImdb/train/pos/1009_8.txt',
 'Data/aclImdb/train/pos/100_7.txt',
 'Data/aclImdb/train/pos/10100_10.txt',
 'Data/aclImdb/train/pos/10101_8.txt',
 'Data/aclImdb/train/pos/10102_7.txt',
 'Data/aclImdb/train/pos/10103_8.txt',
 'Data/aclImdb/train/pos/10104_10.txt',
 'Data/aclImdb/train/pos/10105_8.txt',
 'Data/aclImdb/train/pos/10106_8.txt',
 'Data/aclImdb/train/pos/10107_8.txt',
 'Data/aclImdb/train/pos/10108_10.txt',
 'Data/aclImdb/train/pos/10109_10.txt',
 'Data/aclImdb/train/pos/1010_10.txt',
 'Data/aclImdb/train/pos/10110_10.txt',
 'Data/aclImdb/train/pos/10111_7.txt',
 'Data/aclImdb/train/pos/10112_7.txt',
 'Data/aclImdb/train/pos/10113_10.txt',
 'Data/aclImdb/train/pos/10114_10.txt',
 'Data/aclImdb/train/pos/10115_10.txt',
 'Data/aclImdb/train/pos/10116_10.txt',
 'Data/aclImdb/train/pos/10117_8.txt',
 'Data/aclImdb/train/pos/10118_7.txt',
 'Data/aclImdb/train/pos/10119_7.txt',
 'Data/aclImdb/train/pos/1011_10.txt',
 'Data/aclImdb/train/pos/10120_7.txt',
 'Data/aclImdb/train/pos/10121_8.txt',
 'Data/aclImdb/train/pos/10122_7.txt',
 'Data/aclImdb/train/pos/10123_10.txt',
 'Data/aclImdb/train/pos/10124_8.txt',
 'Data/aclImdb/train/pos/10125_8.txt',
 'Data/aclImdb/train/pos/10126_10.txt',
 'Data/aclImdb/train/pos/10127_8.txt',
 'Data/aclImdb/train/pos/10128_9.txt',
 'Data/aclImdb/train/pos/10129_7.txt',
 'Data/aclImdb/train/pos/1012_10.txt',
 'Data/aclImdb/train/pos/10130_10.txt',
 'Data/aclImdb/train/pos/10131_10.txt',
 'Data/aclImdb/train/pos/10132_9.txt',
 'Data/aclImdb/train/pos/10133_7.txt',
 'Data/aclImdb/train/pos/10134_7.txt',
 'Data/aclImdb/train/pos/10135_7.txt',
 'Data/aclImdb/train/pos/10136_7.txt',
 'Data/aclImdb/train/pos/10137_7.txt',
 'Data/aclImdb/train/pos/10138_8.txt',
 'Data/aclImdb/train/pos/10139_8.txt',
 'Data/aclImdb/train/pos/1013_9.txt',
 'Data/aclImdb/train/pos/10140_8.txt',
 'Data/aclImdb/train/pos/10141_9.txt',
 'Data/aclImdb/train/pos/10142_8.txt',
 'Data/aclImdb/train/pos/10143_8.txt',
 'Data/aclImdb/train/pos/10144_8.txt',
 'Data/aclImdb/train/pos/10145_8.txt',
 'Data/aclImdb/train/pos/10146_7.txt',
 'Data/aclImdb/train/pos/10147_10.txt',
 'Data/aclImdb/train/pos/10148_10.txt',
 'Data/aclImdb/train/pos/10149_9.txt',
 'Data/aclImdb/train/pos/1014_9.txt',
 'Data/aclImdb/train/pos/10150_9.txt',
 'Data/aclImdb/train/pos/10151_8.txt',
 'Data/aclImdb/train/pos/10152_9.txt',
 'Data/aclImdb/train/pos/10153_9.txt',
 'Data/aclImdb/train/pos/10154_8.txt',
 'Data/aclImdb/train/pos/10155_9.txt',
 'Data/aclImdb/train/pos/10156_10.txt',
 'Data/aclImdb/train/pos/10157_10.txt',
 'Data/aclImdb/train/pos/10158_7.txt',
 'Data/aclImdb/train/pos/10159_7.txt',
 'Data/aclImdb/train/pos/1015_10.txt',
 'Data/aclImdb/train/pos/10160_7.txt',
 'Data/aclImdb/train/pos/10161_9.txt',
 'Data/aclImdb/train/pos/10162_9.txt',
 'Data/aclImdb/train/pos/10163_8.txt',
 'Data/aclImdb/train/pos/10164_7.txt',
 'Data/aclImdb/train/pos/10165_7.txt',
 'Data/aclImdb/train/pos/10166_7.txt',
 'Data/aclImdb/train/pos/10167_7.txt',
 'Data/aclImdb/train/pos/10168_8.txt',
 'Data/aclImdb/train/pos/10169_7.txt',
 'Data/aclImdb/train/pos/1016_8.txt',
 'Data/aclImdb/train/pos/10170_8.txt',
 'Data/aclImdb/train/pos/10171_7.txt',
 'Data/aclImdb/train/pos/10172_8.txt',
 'Data/aclImdb/train/pos/10173_8.txt',
 'Data/aclImdb/train/pos/10174_7.txt',
 'Data/aclImdb/train/pos/10175_10.txt',
 'Data/aclImdb/train/pos/10176_7.txt',
 'Data/aclImdb/train/pos/10177_9.txt',
 'Data/aclImdb/train/pos/10178_10.txt',
 'Data/aclImdb/train/pos/10179_9.txt',
 'Data/aclImdb/train/pos/1017_8.txt',
 'Data/aclImdb/train/pos/10180_7.txt',
 'Data/aclImdb/train/pos/10181_8.txt',
 'Data/aclImdb/train/pos/10182_8.txt',
 'Data/aclImdb/train/pos/10183_7.txt',
 'Data/aclImdb/train/pos/10184_9.txt',
 'Data/aclImdb/train/pos/10185_10.txt',
 'Data/aclImdb/train/pos/10186_8.txt',
 'Data/aclImdb/train/pos/10187_7.txt',
 'Data/aclImdb/train/pos/10188_8.txt',
 'Data/aclImdb/train/pos/10189_7.txt',
 'Data/aclImdb/train/pos/1018_8.txt',
 'Data/aclImdb/train/pos/10190_7.txt',
 'Data/aclImdb/train/pos/10191_10.txt',
 'Data/aclImdb/train/pos/10192_8.txt',
 'Data/aclImdb/train/pos/10193_9.txt',
 'Data/aclImdb/train/pos/10194_10.txt',
 'Data/aclImdb/train/pos/10195_8.txt',
 'Data/aclImdb/train/pos/10196_10.txt',
 'Data/aclImdb/train/pos/10197_7.txt',
 'Data/aclImdb/train/pos/10198_8.txt',
 'Data/aclImdb/train/pos/10199_7.txt',
 'Data/aclImdb/train/pos/1019_10.txt',
 'Data/aclImdb/train/pos/101_8.txt',
 'Data/aclImdb/train/pos/10200_10.txt',
 'Data/aclImdb/train/pos/10201_10.txt',
 'Data/aclImdb/train/pos/10202_10.txt',
 'Data/aclImdb/train/pos/10203_10.txt',
 'Data/aclImdb/train/pos/10204_8.txt',
 'Data/aclImdb/train/pos/10205_10.txt',
 'Data/aclImdb/train/pos/10206_10.txt',
 'Data/aclImdb/train/pos/10207_10.txt',
 'Data/aclImdb/train/pos/10208_7.txt',
 'Data/aclImdb/train/pos/10209_7.txt',
 'Data/aclImdb/train/pos/1020_10.txt',
 'Data/aclImdb/train/pos/10210_7.txt',
 'Data/aclImdb/train/pos/10211_7.txt',
 'Data/aclImdb/train/pos/10212_8.txt',
 'Data/aclImdb/train/pos/10213_8.txt',
 'Data/aclImdb/train/pos/10214_10.txt',
 'Data/aclImdb/train/pos/10215_10.txt',
 'Data/aclImdb/train/pos/10216_8.txt',
 'Data/aclImdb/train/pos/10217_9.txt',
 'Data/aclImdb/train/pos/10218_8.txt',
 'Data/aclImdb/train/pos/10219_10.txt',
 'Data/aclImdb/train/pos/1021_10.txt',
 'Data/aclImdb/train/pos/10220_7.txt',
 'Data/aclImdb/train/pos/10221_8.txt',
 'Data/aclImdb/train/pos/10222_9.txt',
 'Data/aclImdb/train/pos/10223_10.txt',
 'Data/aclImdb/train/pos/10224_10.txt',
 'Data/aclImdb/train/pos/10225_9.txt',
 'Data/aclImdb/train/pos/10226_10.txt',
 'Data/aclImdb/train/pos/10227_10.txt',
 'Data/aclImdb/train/pos/10228_8.txt',
 'Data/aclImdb/train/pos/10229_8.txt',
 'Data/aclImdb/train/pos/1022_10.txt',
 'Data/aclImdb/train/pos/10230_9.txt',
 'Data/aclImdb/train/pos/10231_10.txt',
 'Data/aclImdb/train/pos/10232_10.txt',
 'Data/aclImdb/train/pos/10233_7.txt',
 'Data/aclImdb/train/pos/10234_10.txt',
 'Data/aclImdb/train/pos/10235_8.txt',
 'Data/aclImdb/train/pos/10236_8.txt',
 'Data/aclImdb/train/pos/10237_10.txt',
 'Data/aclImdb/train/pos/10238_10.txt',
 'Data/aclImdb/train/pos/10239_10.txt',
 'Data/aclImdb/train/pos/1023_10.txt',
 'Data/aclImdb/train/pos/10240_8.txt',
 'Data/aclImdb/train/pos/10241_8.txt',
 'Data/aclImdb/train/pos/10242_8.txt',
 'Data/aclImdb/train/pos/10243_10.txt',
 'Data/aclImdb/train/pos/10244_7.txt',
 'Data/aclImdb/train/pos/10245_10.txt',
 'Data/aclImdb/train/pos/10246_10.txt',
 'Data/aclImdb/train/pos/10247_10.txt',
 'Data/aclImdb/train/pos/10248_7.txt',
 'Data/aclImdb/train/pos/10249_7.txt',
 'Data/aclImdb/train/pos/1024_9.txt',
 'Data/aclImdb/train/pos/10250_10.txt',
 'Data/aclImdb/train/pos/10251_10.txt',
 'Data/aclImdb/train/pos/10252_9.txt',
 'Data/aclImdb/train/pos/10253_10.txt',
 'Data/aclImdb/train/pos/10254_8.txt',
 'Data/aclImdb/train/pos/10255_9.txt',
 'Data/aclImdb/train/pos/10256_8.txt',
 'Data/aclImdb/train/pos/10257_8.txt',
 'Data/aclImdb/train/pos/10258_10.txt',
 'Data/aclImdb/train/pos/10259_8.txt',
 'Data/aclImdb/train/pos/1025_8.txt',
 'Data/aclImdb/train/pos/10260_10.txt',
 'Data/aclImdb/train/pos/10261_8.txt',
 'Data/aclImdb/train/pos/10262_10.txt',
 'Data/aclImdb/train/pos/10263_10.txt',
 'Data/aclImdb/train/pos/10264_10.txt',
 'Data/aclImdb/train/pos/10265_9.txt',
 'Data/aclImdb/train/pos/10266_9.txt',
 'Data/aclImdb/train/pos/10267_8.txt',
 'Data/aclImdb/train/pos/10268_9.txt',
 'Data/aclImdb/train/pos/10269_7.txt',
 'Data/aclImdb/train/pos/1026_9.txt',
 'Data/aclImdb/train/pos/10270_9.txt',
 'Data/aclImdb/train/pos/10271_10.txt',
 'Data/aclImdb/train/pos/10272_10.txt',
 'Data/aclImdb/train/pos/10273_8.txt',
 'Data/aclImdb/train/pos/10274_8.txt',
 'Data/aclImdb/train/pos/10275_10.txt',
 'Data/aclImdb/train/pos/10276_10.txt',
 'Data/aclImdb/train/pos/10277_9.txt',
 'Data/aclImdb/train/pos/10278_7.txt',
 'Data/aclImdb/train/pos/10279_8.txt',
 'Data/aclImdb/train/pos/1027_8.txt',
 'Data/aclImdb/train/pos/10280_10.txt',
 'Data/aclImdb/train/pos/10281_7.txt',
 'Data/aclImdb/train/pos/10282_8.txt',
 'Data/aclImdb/train/pos/10283_10.txt',
 'Data/aclImdb/train/pos/10284_9.txt',
 'Data/aclImdb/train/pos/10285_10.txt',
 'Data/aclImdb/train/pos/10286_9.txt',
 'Data/aclImdb/train/pos/10287_8.txt',
 'Data/aclImdb/train/pos/10288_10.txt',
 'Data/aclImdb/train/pos/10289_10.txt',
 'Data/aclImdb/train/pos/1028_10.txt',
 'Data/aclImdb/train/pos/10290_8.txt',
 'Data/aclImdb/train/pos/10291_7.txt',
 'Data/aclImdb/train/pos/10292_7.txt',
 'Data/aclImdb/train/pos/10293_8.txt',
 'Data/aclImdb/train/pos/10294_8.txt',
 'Data/aclImdb/train/pos/10295_7.txt',
 'Data/aclImdb/train/pos/10296_8.txt',
 'Data/aclImdb/train/pos/10297_8.txt',
 'Data/aclImdb/train/pos/10298_9.txt',
 'Data/aclImdb/train/pos/10299_9.txt',
 'Data/aclImdb/train/pos/1029_9.txt',
 'Data/aclImdb/train/pos/102_10.txt',
 'Data/aclImdb/train/pos/10300_10.txt',
 'Data/aclImdb/train/pos/10301_8.txt',
 'Data/aclImdb/train/pos/10302_9.txt',
 'Data/aclImdb/train/pos/10303_7.txt',
 'Data/aclImdb/train/pos/10304_7.txt',
 'Data/aclImdb/train/pos/10305_8.txt',
 'Data/aclImdb/train/pos/10306_8.txt',
 'Data/aclImdb/train/pos/10307_8.txt',
 'Data/aclImdb/train/pos/10308_8.txt',
 'Data/aclImdb/train/pos/10309_7.txt',
 'Data/aclImdb/train/pos/1030_10.txt',
 'Data/aclImdb/train/pos/10310_9.txt',
 'Data/aclImdb/train/pos/10311_9.txt',
 'Data/aclImdb/train/pos/10312_10.txt',
 'Data/aclImdb/train/pos/10313_7.txt',
 'Data/aclImdb/train/pos/10314_8.txt',
 'Data/aclImdb/train/pos/10315_8.txt',
 'Data/aclImdb/train/pos/10316_8.txt',
 'Data/aclImdb/train/pos/10317_7.txt',
 'Data/aclImdb/train/pos/10318_7.txt',
 'Data/aclImdb/train/pos/10319_7.txt',
 'Data/aclImdb/train/pos/1031_10.txt',
 'Data/aclImdb/train/pos/10320_7.txt',
 'Data/aclImdb/train/pos/10321_10.txt',
 'Data/aclImdb/train/pos/10322_7.txt',
 'Data/aclImdb/train/pos/10323_10.txt',
 'Data/aclImdb/train/pos/10324_9.txt',
 'Data/aclImdb/train/pos/10325_10.txt',
 'Data/aclImdb/train/pos/10326_10.txt',
 'Data/aclImdb/train/pos/10327_7.txt',
 'Data/aclImdb/train/pos/10328_8.txt',
 'Data/aclImdb/train/pos/10329_8.txt',
 'Data/aclImdb/train/pos/1032_7.txt',
 'Data/aclImdb/train/pos/10330_8.txt',
 'Data/aclImdb/train/pos/10331_10.txt',
 'Data/aclImdb/train/pos/10332_8.txt',
 'Data/aclImdb/train/pos/10333_8.txt',
 'Data/aclImdb/train/pos/10334_8.txt',
 'Data/aclImdb/train/pos/10335_8.txt',
 'Data/aclImdb/train/pos/10336_8.txt',
 'Data/aclImdb/train/pos/10337_9.txt',
 'Data/aclImdb/train/pos/10338_9.txt',
 'Data/aclImdb/train/pos/10339_7.txt',
 'Data/aclImdb/train/pos/1033_10.txt',
 'Data/aclImdb/train/pos/10340_9.txt',
 'Data/aclImdb/train/pos/10341_7.txt',
 'Data/aclImdb/train/pos/10342_7.txt',
 'Data/aclImdb/train/pos/10343_7.txt',
 'Data/aclImdb/train/pos/10344_7.txt',
 'Data/aclImdb/train/pos/10345_7.txt',
 'Data/aclImdb/train/pos/10346_9.txt',
 'Data/aclImdb/train/pos/10347_9.txt',
 'Data/aclImdb/train/pos/10348_8.txt',
 'Data/aclImdb/train/pos/10349_10.txt',
 'Data/aclImdb/train/pos/1034_7.txt',
 'Data/aclImdb/train/pos/10350_10.txt',
 'Data/aclImdb/train/pos/10351_8.txt',
 'Data/aclImdb/train/pos/10352_10.txt',
 'Data/aclImdb/train/pos/10353_9.txt',
 'Data/aclImdb/train/pos/10354_9.txt',
 'Data/aclImdb/train/pos/10355_9.txt',
 'Data/aclImdb/train/pos/10356_9.txt',
 'Data/aclImdb/train/pos/10357_8.txt',
 'Data/aclImdb/train/pos/10358_9.txt',
 'Data/aclImdb/train/pos/10359_7.txt',
 'Data/aclImdb/train/pos/1035_7.txt',
 'Data/aclImdb/train/pos/10360_8.txt',
 'Data/aclImdb/train/pos/10361_7.txt',
 'Data/aclImdb/train/pos/10362_8.txt',
 'Data/aclImdb/train/pos/10363_9.txt',
 'Data/aclImdb/train/pos/10364_10.txt',
 'Data/aclImdb/train/pos/10365_8.txt',
 'Data/aclImdb/train/pos/10366_10.txt',
 'Data/aclImdb/train/pos/10367_8.txt',
 'Data/aclImdb/train/pos/10368_7.txt',
 'Data/aclImdb/train/pos/10369_8.txt',
 'Data/aclImdb/train/pos/1036_9.txt',
 'Data/aclImdb/train/pos/10370_9.txt',
 'Data/aclImdb/train/pos/10371_8.txt',
 'Data/aclImdb/train/pos/10372_7.txt',
 'Data/aclImdb/train/pos/10373_7.txt',
 'Data/aclImdb/train/pos/10374_8.txt',
 'Data/aclImdb/train/pos/10375_10.txt',
 'Data/aclImdb/train/pos/10376_7.txt',
 'Data/aclImdb/train/pos/10377_9.txt',
 'Data/aclImdb/train/pos/10378_8.txt',
 'Data/aclImdb/train/pos/10379_10.txt',
 'Data/aclImdb/train/pos/1037_8.txt',
 'Data/aclImdb/train/pos/10380_10.txt',
 'Data/aclImdb/train/pos/10381_10.txt',
 'Data/aclImdb/train/pos/10382_10.txt',
 'Data/aclImdb/train/pos/10383_10.txt',
 'Data/aclImdb/train/pos/10384_10.txt',
 'Data/aclImdb/train/pos/10385_10.txt',
 'Data/aclImdb/train/pos/10386_8.txt',
 'Data/aclImdb/train/pos/10387_7.txt',
 'Data/aclImdb/train/pos/10388_7.txt',
 'Data/aclImdb/train/pos/10389_10.txt',
 'Data/aclImdb/train/pos/1038_7.txt',
 'Data/aclImdb/train/pos/10390_10.txt',
 'Data/aclImdb/train/pos/10391_10.txt',
 'Data/aclImdb/train/pos/10392_10.txt',
 'Data/aclImdb/train/pos/10393_9.txt',
 'Data/aclImdb/train/pos/10394_10.txt',
 'Data/aclImdb/train/pos/10395_8.txt',
 'Data/aclImdb/train/pos/10396_8.txt',
 'Data/aclImdb/train/pos/10397_8.txt',
 'Data/aclImdb/train/pos/10398_8.txt',
 'Data/aclImdb/train/pos/10399_10.txt',
 'Data/aclImdb/train/pos/1039_9.txt',
 'Data/aclImdb/train/pos/103_7.txt',
 'Data/aclImdb/train/pos/10400_10.txt',
 'Data/aclImdb/train/pos/10401_10.txt',
 'Data/aclImdb/train/pos/10402_10.txt',
 'Data/aclImdb/train/pos/10403_7.txt',
 'Data/aclImdb/train/pos/10404_9.txt',
 'Data/aclImdb/train/pos/10405_8.txt',
 'Data/aclImdb/train/pos/10406_10.txt',
 'Data/aclImdb/train/pos/10407_8.txt',
 'Data/aclImdb/train/pos/10408_10.txt',
 'Data/aclImdb/train/pos/10409_10.txt',
 'Data/aclImdb/train/pos/1040_10.txt',
 'Data/aclImdb/train/pos/10410_10.txt',
 'Data/aclImdb/train/pos/10411_9.txt',
 'Data/aclImdb/train/pos/10412_8.txt',
 'Data/aclImdb/train/pos/10413_10.txt',
 'Data/aclImdb/train/pos/10414_10.txt',
 'Data/aclImdb/train/pos/10415_7.txt',
 'Data/aclImdb/train/pos/10416_9.txt',
 'Data/aclImdb/train/pos/10417_8.txt',
 'Data/aclImdb/train/pos/10418_9.txt',
 'Data/aclImdb/train/pos/10419_10.txt',
 'Data/aclImdb/train/pos/1041_9.txt',
 'Data/aclImdb/train/pos/10420_10.txt',
 'Data/aclImdb/train/pos/10421_7.txt',
 'Data/aclImdb/train/pos/10422_7.txt',
 'Data/aclImdb/train/pos/10423_9.txt',
 'Data/aclImdb/train/pos/10424_9.txt',
 'Data/aclImdb/train/pos/10425_9.txt',
 'Data/aclImdb/train/pos/10426_9.txt',
 'Data/aclImdb/train/pos/10427_8.txt',
 'Data/aclImdb/train/pos/10428_10.txt',
 'Data/aclImdb/train/pos/10429_10.txt',
 'Data/aclImdb/train/pos/1042_10.txt',
 'Data/aclImdb/train/pos/10430_9.txt',
 'Data/aclImdb/train/pos/10431_10.txt',
 'Data/aclImdb/train/pos/10432_10.txt',
 'Data/aclImdb/train/pos/10433_9.txt',
 'Data/aclImdb/train/pos/10434_10.txt',
 'Data/aclImdb/train/pos/10435_7.txt',
 'Data/aclImdb/train/pos/10436_8.txt',
 'Data/aclImdb/train/pos/10437_7.txt',
 'Data/aclImdb/train/pos/10438_9.txt',
 'Data/aclImdb/train/pos/10439_8.txt',
 'Data/aclImdb/train/pos/1043_10.txt',
 'Data/aclImdb/train/pos/10440_9.txt',
 'Data/aclImdb/train/pos/10441_10.txt',
 'Data/aclImdb/train/pos/10442_10.txt',
 'Data/aclImdb/train/pos/10443_9.txt',
 'Data/aclImdb/train/pos/10444_9.txt',
 'Data/aclImdb/train/pos/10445_10.txt',
 'Data/aclImdb/train/pos/10446_10.txt',
 'Data/aclImdb/train/pos/10447_10.txt',
 'Data/aclImdb/train/pos/10448_10.txt',
 'Data/aclImdb/train/pos/10449_9.txt',
 'Data/aclImdb/train/pos/1044_8.txt',
 'Data/aclImdb/train/pos/10450_10.txt',
 'Data/aclImdb/train/pos/10451_10.txt',
 'Data/aclImdb/train/pos/10452_10.txt',
 'Data/aclImdb/train/pos/10453_10.txt',
 'Data/aclImdb/train/pos/10454_9.txt',
 'Data/aclImdb/train/pos/10455_10.txt',
 'Data/aclImdb/train/pos/10456_10.txt',
 'Data/aclImdb/train/pos/10457_8.txt',
 'Data/aclImdb/train/pos/10458_10.txt',
 'Data/aclImdb/train/pos/10459_9.txt',
 'Data/aclImdb/train/pos/1045_8.txt',
 'Data/aclImdb/train/pos/10460_10.txt',
 'Data/aclImdb/train/pos/10461_9.txt',
 'Data/aclImdb/train/pos/10462_7.txt',
 'Data/aclImdb/train/pos/10463_10.txt',
 'Data/aclImdb/train/pos/10464_7.txt',
 'Data/aclImdb/train/pos/10465_8.txt',
 'Data/aclImdb/train/pos/10466_8.txt',
 'Data/aclImdb/train/pos/10467_10.txt',
 'Data/aclImdb/train/pos/10468_9.txt',
 'Data/aclImdb/train/pos/10469_10.txt',
 'Data/aclImdb/train/pos/1046_10.txt',
 'Data/aclImdb/train/pos/10470_9.txt',
 'Data/aclImdb/train/pos/10471_10.txt',
 'Data/aclImdb/train/pos/10472_7.txt',
 'Data/aclImdb/train/pos/10473_10.txt',
 'Data/aclImdb/train/pos/10474_9.txt',
 'Data/aclImdb/train/pos/10475_8.txt',
 'Data/aclImdb/train/pos/10476_9.txt',
 'Data/aclImdb/train/pos/10477_9.txt',
 'Data/aclImdb/train/pos/10478_8.txt',
 'Data/aclImdb/train/pos/10479_10.txt',
 'Data/aclImdb/train/pos/1047_8.txt',
 'Data/aclImdb/train/pos/10480_10.txt',
 'Data/aclImdb/train/pos/10481_8.txt',
 'Data/aclImdb/train/pos/10482_10.txt',
 'Data/aclImdb/train/pos/10483_8.txt',
 'Data/aclImdb/train/pos/10484_8.txt',
 'Data/aclImdb/train/pos/10485_8.txt',
 'Data/aclImdb/train/pos/10486_7.txt',
 'Data/aclImdb/train/pos/10487_7.txt',
 'Data/aclImdb/train/pos/10488_10.txt',
 'Data/aclImdb/train/pos/10489_10.txt',
 'Data/aclImdb/train/pos/1048_8.txt',
 'Data/aclImdb/train/pos/10490_7.txt',
 'Data/aclImdb/train/pos/10491_7.txt',
 'Data/aclImdb/train/pos/10492_10.txt',
 'Data/aclImdb/train/pos/10493_9.txt',
 'Data/aclImdb/train/pos/10494_10.txt',
 'Data/aclImdb/train/pos/10495_7.txt',
 'Data/aclImdb/train/pos/10496_10.txt',
 'Data/aclImdb/train/pos/10497_8.txt',
 'Data/aclImdb/train/pos/10498_10.txt',
 'Data/aclImdb/train/pos/10499_10.txt',
 'Data/aclImdb/train/pos/1049_7.txt',
 'Data/aclImdb/train/pos/104_10.txt',
 'Data/aclImdb/train/pos/10500_10.txt',
 'Data/aclImdb/train/pos/10501_10.txt',
 'Data/aclImdb/train/pos/10502_9.txt',
 'Data/aclImdb/train/pos/10503_10.txt',
 'Data/aclImdb/train/pos/10504_9.txt',
 'Data/aclImdb/train/pos/10505_10.txt',
 'Data/aclImdb/train/pos/10506_10.txt',
 'Data/aclImdb/train/pos/10507_10.txt',
 'Data/aclImdb/train/pos/10508_10.txt',
 'Data/aclImdb/train/pos/10509_7.txt',
 'Data/aclImdb/train/pos/1050_9.txt',
 'Data/aclImdb/train/pos/10510_7.txt',
 'Data/aclImdb/train/pos/10511_7.txt',
 'Data/aclImdb/train/pos/10512_10.txt',
 'Data/aclImdb/train/pos/10513_7.txt',
 'Data/aclImdb/train/pos/10514_8.txt',
 'Data/aclImdb/train/pos/10515_9.txt',
 'Data/aclImdb/train/pos/10516_7.txt',
 'Data/aclImdb/train/pos/10517_8.txt',
 'Data/aclImdb/train/pos/10518_9.txt',
 'Data/aclImdb/train/pos/10519_9.txt',
 'Data/aclImdb/train/pos/1051_9.txt',
 'Data/aclImdb/train/pos/10520_9.txt',
 'Data/aclImdb/train/pos/10521_9.txt',
 'Data/aclImdb/train/pos/10522_7.txt',
 'Data/aclImdb/train/pos/10523_9.txt',
 'Data/aclImdb/train/pos/10524_10.txt',
 'Data/aclImdb/train/pos/10525_10.txt',
 'Data/aclImdb/train/pos/10526_9.txt',
 'Data/aclImdb/train/pos/10527_10.txt',
 'Data/aclImdb/train/pos/10528_10.txt',
 'Data/aclImdb/train/pos/10529_10.txt',
 'Data/aclImdb/train/pos/1052_8.txt',
 'Data/aclImdb/train/pos/10530_10.txt',
 'Data/aclImdb/train/pos/10531_10.txt',
 'Data/aclImdb/train/pos/10532_8.txt',
 'Data/aclImdb/train/pos/10533_10.txt',
 'Data/aclImdb/train/pos/10534_7.txt',
 'Data/aclImdb/train/pos/10535_7.txt',
 'Data/aclImdb/train/pos/10536_10.txt',
 'Data/aclImdb/train/pos/10537_10.txt',
 'Data/aclImdb/train/pos/10538_8.txt',
 'Data/aclImdb/train/pos/10539_10.txt',
 'Data/aclImdb/train/pos/1053_8.txt',
 'Data/aclImdb/train/pos/10540_10.txt',
 'Data/aclImdb/train/pos/10541_10.txt',
 'Data/aclImdb/train/pos/10542_7.txt',
 'Data/aclImdb/train/pos/10543_8.txt',
 'Data/aclImdb/train/pos/10544_8.txt',
 'Data/aclImdb/train/pos/10545_7.txt',
 'Data/aclImdb/train/pos/10546_9.txt',
 'Data/aclImdb/train/pos/10547_9.txt',
 'Data/aclImdb/train/pos/10548_7.txt',
 'Data/aclImdb/train/pos/10549_9.txt',
 'Data/aclImdb/train/pos/1054_8.txt',
 'Data/aclImdb/train/pos/10550_8.txt',
 'Data/aclImdb/train/pos/10551_7.txt',
 'Data/aclImdb/train/pos/10552_9.txt',
 'Data/aclImdb/train/pos/10553_8.txt',
 'Data/aclImdb/train/pos/10554_7.txt',
 'Data/aclImdb/train/pos/10555_8.txt',
 'Data/aclImdb/train/pos/10556_10.txt',
 'Data/aclImdb/train/pos/10557_9.txt',
 'Data/aclImdb/train/pos/10558_10.txt',
 'Data/aclImdb/train/pos/10559_8.txt',
 'Data/aclImdb/train/pos/1055_10.txt',
 'Data/aclImdb/train/pos/10560_9.txt',
 'Data/aclImdb/train/pos/10561_8.txt',
 'Data/aclImdb/train/pos/10562_9.txt',
 'Data/aclImdb/train/pos/10563_7.txt',
 'Data/aclImdb/train/pos/10564_10.txt',
 'Data/aclImdb/train/pos/10565_9.txt',
 'Data/aclImdb/train/pos/10566_8.txt',
 'Data/aclImdb/train/pos/10567_9.txt',
 'Data/aclImdb/train/pos/10568_10.txt',
 'Data/aclImdb/train/pos/10569_10.txt',
 'Data/aclImdb/train/pos/1056_10.txt',
 'Data/aclImdb/train/pos/10570_8.txt',
 'Data/aclImdb/train/pos/10571_8.txt',
 'Data/aclImdb/train/pos/10572_8.txt',
 'Data/aclImdb/train/pos/10573_10.txt',
 'Data/aclImdb/train/pos/10574_10.txt',
 'Data/aclImdb/train/pos/10575_9.txt',
 'Data/aclImdb/train/pos/10576_7.txt',
 'Data/aclImdb/train/pos/10577_10.txt',
 'Data/aclImdb/train/pos/10578_7.txt',
 'Data/aclImdb/train/pos/10579_10.txt',
 'Data/aclImdb/train/pos/1057_9.txt',
 'Data/aclImdb/train/pos/10580_8.txt',
 'Data/aclImdb/train/pos/10581_10.txt',
 'Data/aclImdb/train/pos/10582_10.txt',
 'Data/aclImdb/train/pos/10583_10.txt',
 'Data/aclImdb/train/pos/10584_10.txt',
 'Data/aclImdb/train/pos/10585_9.txt',
 'Data/aclImdb/train/pos/10586_10.txt',
 'Data/aclImdb/train/pos/10587_8.txt',
 'Data/aclImdb/train/pos/10588_10.txt',
 'Data/aclImdb/train/pos/10589_10.txt',
 'Data/aclImdb/train/pos/1058_10.txt',
 'Data/aclImdb/train/pos/10590_8.txt',
 'Data/aclImdb/train/pos/10591_10.txt',
 'Data/aclImdb/train/pos/10592_8.txt',
 'Data/aclImdb/train/pos/10593_8.txt',
 'Data/aclImdb/train/pos/10594_8.txt',
 'Data/aclImdb/train/pos/10595_10.txt',
 'Data/aclImdb/train/pos/10596_8.txt',
 'Data/aclImdb/train/pos/10597_9.txt',
 'Data/aclImdb/train/pos/10598_8.txt',
 'Data/aclImdb/train/pos/10599_8.txt',
 'Data/aclImdb/train/pos/1059_10.txt',
 'Data/aclImdb/train/pos/105_7.txt',
 'Data/aclImdb/train/pos/10600_9.txt',
 'Data/aclImdb/train/pos/10601_10.txt',
 'Data/aclImdb/train/pos/10602_10.txt',
 'Data/aclImdb/train/pos/10603_10.txt',
 'Data/aclImdb/train/pos/10604_7.txt',
 'Data/aclImdb/train/pos/10605_7.txt',
 'Data/aclImdb/train/pos/10606_10.txt',
 'Data/aclImdb/train/pos/10607_10.txt',
 'Data/aclImdb/train/pos/10608_10.txt',
 'Data/aclImdb/train/pos/10609_10.txt',
 'Data/aclImdb/train/pos/1060_10.txt',
 'Data/aclImdb/train/pos/10610_8.txt',
 'Data/aclImdb/train/pos/10611_8.txt',
 'Data/aclImdb/train/pos/10612_7.txt',
 'Data/aclImdb/train/pos/10613_8.txt',
 'Data/aclImdb/train/pos/10614_7.txt',
 'Data/aclImdb/train/pos/10615_8.txt',
 'Data/aclImdb/train/pos/10616_7.txt',
 'Data/aclImdb/train/pos/10617_8.txt',
 'Data/aclImdb/train/pos/10618_8.txt',
 'Data/aclImdb/train/pos/10619_8.txt',
 'Data/aclImdb/train/pos/1061_10.txt',
 'Data/aclImdb/train/pos/10620_10.txt',
 'Data/aclImdb/train/pos/10621_10.txt',
 'Data/aclImdb/train/pos/10622_10.txt',
 'Data/aclImdb/train/pos/10623_8.txt',
 'Data/aclImdb/train/pos/10624_7.txt',
 'Data/aclImdb/train/pos/10625_7.txt',
 'Data/aclImdb/train/pos/10626_7.txt',
 'Data/aclImdb/train/pos/10627_10.txt',
 'Data/aclImdb/train/pos/10628_7.txt',
 'Data/aclImdb/train/pos/10629_10.txt',
 'Data/aclImdb/train/pos/1062_10.txt',
 'Data/aclImdb/train/pos/10630_8.txt',
 'Data/aclImdb/train/pos/10631_8.txt',
 'Data/aclImdb/train/pos/10632_10.txt',
 'Data/aclImdb/train/pos/10633_9.txt',
 'Data/aclImdb/train/pos/10634_10.txt',
 'Data/aclImdb/train/pos/10635_8.txt',
 'Data/aclImdb/train/pos/10636_8.txt',
 'Data/aclImdb/train/pos/10637_10.txt',
 'Data/aclImdb/train/pos/10638_8.txt',
 'Data/aclImdb/train/pos/10639_7.txt',
 'Data/aclImdb/train/pos/1063_10.txt',
 'Data/aclImdb/train/pos/10640_8.txt',
 'Data/aclImdb/train/pos/10641_7.txt',
 'Data/aclImdb/train/pos/10642_8.txt',
 'Data/aclImdb/train/pos/10643_8.txt',
 'Data/aclImdb/train/pos/10644_8.txt',
 'Data/aclImdb/train/pos/10645_8.txt',
 'Data/aclImdb/train/pos/10646_8.txt',
 'Data/aclImdb/train/pos/10647_8.txt',
 'Data/aclImdb/train/pos/10648_8.txt',
 'Data/aclImdb/train/pos/10649_7.txt',
 'Data/aclImdb/train/pos/1064_10.txt',
 'Data/aclImdb/train/pos/10650_8.txt',
 'Data/aclImdb/train/pos/10651_7.txt',
 'Data/aclImdb/train/pos/10652_9.txt',
 'Data/aclImdb/train/pos/10653_10.txt',
 'Data/aclImdb/train/pos/10654_7.txt',
 'Data/aclImdb/train/pos/10655_9.txt',
 'Data/aclImdb/train/pos/10656_7.txt',
 'Data/aclImdb/train/pos/10657_8.txt',
 'Data/aclImdb/train/pos/10658_10.txt',
 'Data/aclImdb/train/pos/10659_8.txt',
 'Data/aclImdb/train/pos/1065_10.txt',
 'Data/aclImdb/train/pos/10660_10.txt',
 'Data/aclImdb/train/pos/10661_9.txt',
 'Data/aclImdb/train/pos/10662_7.txt',
 'Data/aclImdb/train/pos/10663_8.txt',
 'Data/aclImdb/train/pos/10664_8.txt',
 'Data/aclImdb/train/pos/10665_8.txt',
 'Data/aclImdb/train/pos/10666_8.txt',
 'Data/aclImdb/train/pos/10667_8.txt',
 'Data/aclImdb/train/pos/10668_7.txt',
 'Data/aclImdb/train/pos/10669_10.txt',
 'Data/aclImdb/train/pos/1066_10.txt',
 'Data/aclImdb/train/pos/10670_10.txt',
 'Data/aclImdb/train/pos/10671_10.txt',
 'Data/aclImdb/train/pos/10672_9.txt',
 'Data/aclImdb/train/pos/10673_10.txt',
 'Data/aclImdb/train/pos/10674_8.txt',
 'Data/aclImdb/train/pos/10675_8.txt',
 'Data/aclImdb/train/pos/10676_9.txt',
 'Data/aclImdb/train/pos/10677_8.txt',
 'Data/aclImdb/train/pos/10678_9.txt',
 'Data/aclImdb/train/pos/10679_10.txt',
 'Data/aclImdb/train/pos/1067_7.txt',
 'Data/aclImdb/train/pos/10680_8.txt',
 'Data/aclImdb/train/pos/10681_10.txt',
 'Data/aclImdb/train/pos/10682_10.txt',
 'Data/aclImdb/train/pos/10683_7.txt',
 'Data/aclImdb/train/pos/10684_9.txt',
 'Data/aclImdb/train/pos/10685_7.txt',
 'Data/aclImdb/train/pos/10686_8.txt',
 'Data/aclImdb/train/pos/10687_10.txt',
 'Data/aclImdb/train/pos/10688_9.txt',
 'Data/aclImdb/train/pos/10689_8.txt',
 'Data/aclImdb/train/pos/1068_10.txt',
 'Data/aclImdb/train/pos/10690_10.txt',
 'Data/aclImdb/train/pos/10691_7.txt',
 'Data/aclImdb/train/pos/10692_8.txt',
 'Data/aclImdb/train/pos/10693_8.txt',
 'Data/aclImdb/train/pos/10694_7.txt',
 'Data/aclImdb/train/pos/10695_8.txt',
 'Data/aclImdb/train/pos/10696_7.txt',
 'Data/aclImdb/train/pos/10697_8.txt',
 'Data/aclImdb/train/pos/10698_9.txt',
 'Data/aclImdb/train/pos/10699_9.txt',
 'Data/aclImdb/train/pos/1069_10.txt',
 'Data/aclImdb/train/pos/106_10.txt',
 'Data/aclImdb/train/pos/10700_8.txt',
 'Data/aclImdb/train/pos/10701_10.txt',
 'Data/aclImdb/train/pos/10702_10.txt',
 'Data/aclImdb/train/pos/10703_7.txt',
 'Data/aclImdb/train/pos/10704_10.txt',
 'Data/aclImdb/train/pos/10705_7.txt',
 'Data/aclImdb/train/pos/10706_7.txt',
 'Data/aclImdb/train/pos/10707_8.txt',
 'Data/aclImdb/train/pos/10708_8.txt',
 'Data/aclImdb/train/pos/10709_10.txt',
 'Data/aclImdb/train/pos/1070_8.txt',
 'Data/aclImdb/train/pos/10710_9.txt',
 'Data/aclImdb/train/pos/10711_10.txt',
 'Data/aclImdb/train/pos/10712_8.txt',
 'Data/aclImdb/train/pos/10713_9.txt',
 'Data/aclImdb/train/pos/10714_8.txt',
 'Data/aclImdb/train/pos/10715_8.txt',
 'Data/aclImdb/train/pos/10716_7.txt',
 'Data/aclImdb/train/pos/10717_10.txt',
 'Data/aclImdb/train/pos/10718_10.txt',
 'Data/aclImdb/train/pos/10719_10.txt',
 'Data/aclImdb/train/pos/1071_8.txt',
 'Data/aclImdb/train/pos/10720_9.txt',
 'Data/aclImdb/train/pos/10721_9.txt',
 'Data/aclImdb/train/pos/10722_10.txt',
 'Data/aclImdb/train/pos/10723_8.txt',
 'Data/aclImdb/train/pos/10724_8.txt',
 'Data/aclImdb/train/pos/10725_9.txt',
 'Data/aclImdb/train/pos/10726_7.txt',
 'Data/aclImdb/train/pos/10727_7.txt',
 'Data/aclImdb/train/pos/10728_10.txt',
 'Data/aclImdb/train/pos/10729_8.txt',
 'Data/aclImdb/train/pos/1072_10.txt',
 'Data/aclImdb/train/pos/10730_10.txt',
 'Data/aclImdb/train/pos/10731_7.txt',
 'Data/aclImdb/train/pos/10732_8.txt',
 'Data/aclImdb/train/pos/10733_7.txt',
 'Data/aclImdb/train/pos/10734_10.txt',
 'Data/aclImdb/train/pos/10735_10.txt',
 'Data/aclImdb/train/pos/10736_10.txt',
 'Data/aclImdb/train/pos/10737_10.txt',
 'Data/aclImdb/train/pos/10738_9.txt',
 'Data/aclImdb/train/pos/10739_10.txt',
 'Data/aclImdb/train/pos/1073_9.txt',
 'Data/aclImdb/train/pos/10740_8.txt',
 'Data/aclImdb/train/pos/10741_10.txt',
 'Data/aclImdb/train/pos/10742_9.txt',
 'Data/aclImdb/train/pos/10743_9.txt',
 'Data/aclImdb/train/pos/10744_8.txt',
 'Data/aclImdb/train/pos/10745_10.txt',
 'Data/aclImdb/train/pos/10746_10.txt',
 'Data/aclImdb/train/pos/10747_10.txt',
 'Data/aclImdb/train/pos/10748_10.txt',
 'Data/aclImdb/train/pos/10749_8.txt',
 'Data/aclImdb/train/pos/1074_10.txt',
 'Data/aclImdb/train/pos/10750_8.txt',
 'Data/aclImdb/train/pos/10751_10.txt',
 'Data/aclImdb/train/pos/10752_10.txt',
 'Data/aclImdb/train/pos/10753_10.txt',
 'Data/aclImdb/train/pos/10754_10.txt',
 'Data/aclImdb/train/pos/10755_10.txt',
 'Data/aclImdb/train/pos/10756_8.txt',
 'Data/aclImdb/train/pos/10757_10.txt',
 'Data/aclImdb/train/pos/10758_8.txt',
 'Data/aclImdb/train/pos/10759_9.txt',
 'Data/aclImdb/train/pos/1075_10.txt',
 'Data/aclImdb/train/pos/10760_8.txt',
 'Data/aclImdb/train/pos/10761_10.txt',
 'Data/aclImdb/train/pos/10762_10.txt',
 'Data/aclImdb/train/pos/10763_8.txt',
 'Data/aclImdb/train/pos/10764_9.txt',
 'Data/aclImdb/train/pos/10765_10.txt',
 'Data/aclImdb/train/pos/10766_7.txt',
 'Data/aclImdb/train/pos/10767_10.txt',
 'Data/aclImdb/train/pos/10768_7.txt',
 'Data/aclImdb/train/pos/10769_10.txt',
 'Data/aclImdb/train/pos/1076_8.txt',
 'Data/aclImdb/train/pos/10770_7.txt',
 'Data/aclImdb/train/pos/10771_10.txt',
 'Data/aclImdb/train/pos/10772_10.txt',
 'Data/aclImdb/train/pos/10773_9.txt',
 'Data/aclImdb/train/pos/10774_8.txt',
 'Data/aclImdb/train/pos/10775_8.txt',
 'Data/aclImdb/train/pos/10776_8.txt',
 'Data/aclImdb/train/pos/10777_9.txt',
 'Data/aclImdb/train/pos/10778_8.txt',
 'Data/aclImdb/train/pos/10779_10.txt',
 'Data/aclImdb/train/pos/1077_8.txt',
 'Data/aclImdb/train/pos/10780_10.txt',
 'Data/aclImdb/train/pos/10781_10.txt',
 'Data/aclImdb/train/pos/10782_7.txt',
 'Data/aclImdb/train/pos/10783_10.txt',
 'Data/aclImdb/train/pos/10784_10.txt',
 'Data/aclImdb/train/pos/10785_10.txt',
 'Data/aclImdb/train/pos/10786_10.txt',
 'Data/aclImdb/train/pos/10787_10.txt',
 'Data/aclImdb/train/pos/10788_10.txt',
 'Data/aclImdb/train/pos/10789_10.txt',
 'Data/aclImdb/train/pos/1078_8.txt',
 'Data/aclImdb/train/pos/10790_8.txt',
 'Data/aclImdb/train/pos/10791_9.txt',
 'Data/aclImdb/train/pos/10792_9.txt',
 'Data/aclImdb/train/pos/10793_10.txt',
 'Data/aclImdb/train/pos/10794_10.txt',
 'Data/aclImdb/train/pos/10795_7.txt',
 'Data/aclImdb/train/pos/10796_9.txt',
 'Data/aclImdb/train/pos/10797_8.txt',
 'Data/aclImdb/train/pos/10798_8.txt',
 'Data/aclImdb/train/pos/10799_7.txt',
 'Data/aclImdb/train/pos/1079_7.txt',
 'Data/aclImdb/train/pos/107_10.txt',
 'Data/aclImdb/train/pos/10800_8.txt',
 'Data/aclImdb/train/pos/10801_8.txt',
 'Data/aclImdb/train/pos/10802_8.txt',
 'Data/aclImdb/train/pos/10803_8.txt',
 'Data/aclImdb/train/pos/10804_10.txt',
 'Data/aclImdb/train/pos/10805_10.txt',
 'Data/aclImdb/train/pos/10806_9.txt',
 'Data/aclImdb/train/pos/10807_9.txt',
 'Data/aclImdb/train/pos/10808_10.txt',
 'Data/aclImdb/train/pos/10809_10.txt',
 'Data/aclImdb/train/pos/1080_9.txt',
 'Data/aclImdb/train/pos/10810_8.txt',
 'Data/aclImdb/train/pos/10811_7.txt',
 'Data/aclImdb/train/pos/10812_8.txt',
 'Data/aclImdb/train/pos/10813_10.txt',
 'Data/aclImdb/train/pos/10814_7.txt',
 'Data/aclImdb/train/pos/10815_10.txt',
 'Data/aclImdb/train/pos/10816_10.txt',
 'Data/aclImdb/train/pos/10817_10.txt',
 'Data/aclImdb/train/pos/10818_10.txt',
 'Data/aclImdb/train/pos/10819_10.txt',
 'Data/aclImdb/train/pos/1081_10.txt',
 'Data/aclImdb/train/pos/10820_10.txt',
 'Data/aclImdb/train/pos/10821_8.txt',
 'Data/aclImdb/train/pos/10822_10.txt',
 'Data/aclImdb/train/pos/10823_8.txt',
 'Data/aclImdb/train/pos/10824_10.txt',
 'Data/aclImdb/train/pos/10825_9.txt',
 'Data/aclImdb/train/pos/10826_10.txt',
 'Data/aclImdb/train/pos/10827_10.txt',
 'Data/aclImdb/train/pos/10828_10.txt',
 'Data/aclImdb/train/pos/10829_10.txt',
 'Data/aclImdb/train/pos/1082_10.txt',
 'Data/aclImdb/train/pos/10830_10.txt',
 'Data/aclImdb/train/pos/10831_7.txt',
 'Data/aclImdb/train/pos/10832_10.txt',
 'Data/aclImdb/train/pos/10833_10.txt',
 'Data/aclImdb/train/pos/10834_7.txt',
 'Data/aclImdb/train/pos/10835_10.txt',
 'Data/aclImdb/train/pos/10836_10.txt',
 'Data/aclImdb/train/pos/10837_10.txt',
 'Data/aclImdb/train/pos/10838_10.txt',
 'Data/aclImdb/train/pos/10839_10.txt',
 'Data/aclImdb/train/pos/1083_10.txt',
 'Data/aclImdb/train/pos/10840_9.txt',
 'Data/aclImdb/train/pos/10841_10.txt',
 'Data/aclImdb/train/pos/10842_7.txt',
 'Data/aclImdb/train/pos/10843_7.txt',
 'Data/aclImdb/train/pos/10844_9.txt',
 'Data/aclImdb/train/pos/10845_10.txt',
 'Data/aclImdb/train/pos/10846_9.txt',
 'Data/aclImdb/train/pos/10847_10.txt',
 'Data/aclImdb/train/pos/10848_10.txt',
 'Data/aclImdb/train/pos/10849_10.txt',
 'Data/aclImdb/train/pos/1084_9.txt',
 'Data/aclImdb/train/pos/10850_10.txt',
 'Data/aclImdb/train/pos/10851_9.txt',
 'Data/aclImdb/train/pos/10852_10.txt',
 'Data/aclImdb/train/pos/10853_10.txt',
 'Data/aclImdb/train/pos/10854_10.txt',
 'Data/aclImdb/train/pos/10855_9.txt',
 'Data/aclImdb/train/pos/10856_8.txt',
 'Data/aclImdb/train/pos/10857_8.txt',
 'Data/aclImdb/train/pos/10858_8.txt',
 'Data/aclImdb/train/pos/10859_7.txt',
 'Data/aclImdb/train/pos/1085_7.txt',
 'Data/aclImdb/train/pos/10860_7.txt',
 'Data/aclImdb/train/pos/10861_7.txt',
 'Data/aclImdb/train/pos/10862_9.txt',
 'Data/aclImdb/train/pos/10863_8.txt',
 'Data/aclImdb/train/pos/10864_8.txt',
 'Data/aclImdb/train/pos/10865_7.txt',
 'Data/aclImdb/train/pos/10866_7.txt',
 'Data/aclImdb/train/pos/10867_7.txt',
 'Data/aclImdb/train/pos/10868_8.txt',
 'Data/aclImdb/train/pos/10869_7.txt',
 'Data/aclImdb/train/pos/1086_7.txt',
 'Data/aclImdb/train/pos/10870_8.txt',
 'Data/aclImdb/train/pos/10871_7.txt',
 'Data/aclImdb/train/pos/10872_7.txt',
 'Data/aclImdb/train/pos/10873_8.txt',
 'Data/aclImdb/train/pos/10874_10.txt',
 'Data/aclImdb/train/pos/10875_8.txt',
 'Data/aclImdb/train/pos/10876_7.txt',
 'Data/aclImdb/train/pos/10877_10.txt',
 'Data/aclImdb/train/pos/10878_7.txt',
 'Data/aclImdb/train/pos/10879_10.txt',
 'Data/aclImdb/train/pos/1087_10.txt',
 'Data/aclImdb/train/pos/10880_8.txt',
 'Data/aclImdb/train/pos/10881_7.txt',
 'Data/aclImdb/train/pos/10882_8.txt',
 'Data/aclImdb/train/pos/10883_7.txt',
 'Data/aclImdb/train/pos/10884_8.txt',
 'Data/aclImdb/train/pos/10885_7.txt',
 'Data/aclImdb/train/pos/10886_10.txt',
 'Data/aclImdb/train/pos/10887_7.txt',
 'Data/aclImdb/train/pos/10888_8.txt',
 'Data/aclImdb/train/pos/10889_10.txt',
 'Data/aclImdb/train/pos/1088_9.txt',
 'Data/aclImdb/train/pos/10890_9.txt',
 'Data/aclImdb/train/pos/10891_7.txt',
 'Data/aclImdb/train/pos/10892_7.txt',
 'Data/aclImdb/train/pos/10893_8.txt',
 'Data/aclImdb/train/pos/10894_8.txt',
 'Data/aclImdb/train/pos/10895_7.txt',
 'Data/aclImdb/train/pos/10896_8.txt',
 'Data/aclImdb/train/pos/10897_9.txt',
 'Data/aclImdb/train/pos/10898_7.txt',
 'Data/aclImdb/train/pos/10899_10.txt',
 'Data/aclImdb/train/pos/1089_10.txt',
 'Data/aclImdb/train/pos/108_10.txt',
 ...]

Let's read-in a (random) sample negative review, for the movie Haunted Boat, whose file number is 3446_1.txt. The '_1' in the file title lets us know that this is a 1-star review. Do you concur?

In [20]:
sample_text_neg = open('{}train/neg/3446_1.txt'.format(data_path)).read()
print(sample_text_neg)
This film on paper looked like it could possibly be good, after watching though i realised that this film was completely terrible!! The plot has no meaning, and i think i counted the best part of 5000 cut scenes each one making the film more annoying boring and ridiculous. I watched this late night pitch black no noise at all just to add to the SCARINESS of it but the truth is the only thing that scared me was the music, what they would call tragic music, they play opera i mean be serious!! This film sums up all of what is not good about this type of film. To be honest ill say no more but watch at your own risk this film is just complete rubbish, ENJOY!!

We'll read-in a positive review, now, for a movie called The Night Listener, whose file number is 10015_8.txt. This review is supposed to be an 8-star review. Can you spot the difference?

In [21]:
sample_text_pos = open('{}train/pos/10015_8.txt'.format(data_path)).read()
print(sample_text_pos)
Popular radio storyteller Gabriel No one(Robin Williams,scraggy and speaking in hushed,hypnotic tones) becomes acquainted and friends with a fourteen-year-old boy from Wisconsin named Pete Logand(Rory Culkin),who has written a book detailing sexual abuse from his parents. To boot,Pete has AIDS and this compels Gabriel further still,since his partner Jess(Bobby Cannavale,good)happens to be a survivor of HIV himself. <br /><br />He also acquaints himself with Pete's guardian,a woman named Donna(Toni Collette,brilliant!)and when Gabriel decides he wants to meet and talk to the two of them in person and goes to Wisconsin,he discovers some secrets he was(naturally)not prepared to find.<br /><br />Based on real events that happened to Armistead Maupin(who co-wrote the screenplay with Terry Anderson)and directed by Patrick Stetner,this film moves a lot faster(90 min.,maybe a few minutes longer)than one might think a movie of this genre would run. That's good in that it keeps the action and storyline lean and clear. It's bad in that it leaves various holes in the plot and doesn't sew-up any of the plot openings or back-story. I'd rather not go into any great detail except to say that,if you are not familiar with Mr.Maupin's works or his personal story,you feel a little bit out of the loop here. Still,the performances by Williams( I would've loved to heard more of his narration,personally),Collette,Cannavale,Culkin and much of the supporting cast(the Waitress at the restaurant Collete's Donna frequents does a great job with what small part she has!)are top-notch and the mood established here--namely,the chilly,lonely dark exteriors of Wisconsin and New York--give a terrific framing for this story. It may have ends that don't tie together particularly well,but it's still a compelling enough story to stick with.

Back to top

BAG OF WORDS PROCESSING<a name=bow></a>

We're going to use a bag of words model, so let's explore how we could tokenize (that is, separate) the text into words (the tokens).

First, to split a review into sentences we can use the standard sent_tokenize function from NLTK. For instance, the following piece of code will extract the 4th sentence that the tokenizer recognizes (in Python, indexing starts with 0.

In [22]:
sample_sent = sent_tokenize(sample_text_neg)[3]
print(sample_sent)
This film sums up all of what is not good about this type of film.

We can also try the word_tokenize function to split into words, and do a stemming operation (finding the roots) to normalize word forms.

The following code will stem all the words in the the sample_sent sentence from above.

In [23]:
sample_words = [ stemmer.stem(word)
                for word in word_tokenize(sample_sent) ]
print(sample_words)
['thi', 'film', 'sum', 'up', 'al', 'of', 'what', 'is', 'not', 'good', 'about', 'thi', 'typ', 'of', 'film', '.']

One serious problem with a bag of words approach, especially for sentiment analysis, is that the presence of negative/positive words does not imply negative/positive sentiment if the words are negated in the sentence (e.g. "not bad" actually means "good" even though in general an occurrence of "bad" means "bad").

NLTK includes the function mark_negation which takes a tokenized sentence and marks negated words with a '_NEG' suffix. Specifically, mark_negation marks all words that come after a negation word and before the next punctuation mark. Now 'good' becomes the word 'good_NEG' so a bag of words model can pick up on the context of the word.

In [24]:
mark_negation(sample_words)
Out[24]:
['thi',
 'film',
 'sum',
 'up',
 'al',
 'of',
 'what',
 'is',
 'not',
 'good_NEG',
 'about_NEG',
 'thi_NEG',
 'typ_NEG',
 'of_NEG',
 'film_NEG',
 '.']

Here's our complete tokenizer function.

  1. It tokenizes the text into sentences.
  2. For each tokenized sentence, it tokenizes it into words.
  3. It keeps only those words of length >= 2.
  4. It stems the words to only retain the roots.
  5. It marks the negation of certain words.
In [25]:
def tokenizer(text):
    sents = sent_tokenize(text)
    tokens = []
    for sent in sents:
        words = word_tokenize(sent)
        words = [ word for word in words if len(word) >= 2 ]
        words = [ stemmer.stem(word) for word in words ]
        words = mark_negation(words)
        tokens += words
    return tokens

Now that we have a tokenizer, we can use standard feature extraction methods to get feature vectors for each document.

We'll use the scikit-learn module for the rest of the feature extraction and training; it contains TfidfVectorizer class which allows us to define a custom tokenizer and returns a TFIDF matrix.

Now we'll use this class to convert all of the training documents to a Document-Term Matrix format (DTM) feature vectors, using the tokenizer defined above (this step can take a few minutes to run).

In [26]:
vectorizer = TfidfVectorizer(min_df=1, tokenizer=tokenizer)
train_matrix = vectorizer.fit_transform(train_docs)

We can explore the matrix to get an idea of what it contains. It should contain 25000 documents (as per the introduction), but how many fetaures have been retained?

In [32]:
train_matrix.shape
Out[32]:
(25000, 105350)

A fair amount, as it happens: 25000 documents, 105494 features. We can also find the non-zero entries among a subset of the DTM matrix, but that doesn't give us much information at this stage (it will only print the non-zero entries, but we don't know what the features are).

In [36]:
print(train_matrix[0:9,0:9])
  (0, 2)	0.0553631491077
  (1, 3)	0.0250716608331
  (2, 2)	0.0341384144135
  (4, 2)	0.100478224693
  (7, 2)	0.141603801927
  (7, 3)	0.0266220723609

Back to top

MULTINOMIAL NAIVE BAYES CLASSIFIER<a name=mnb></a>

Multinomial naive Bayes (MultinomialNB) is one of various classification models in scikit-learn (if we wanted to find the best possible classifier, we'd have to try some of the others, but at this stage I just want to show you how the sentiment analysis works).

The fit function takes the feature matrix as well as the vector of labels we made when we read the data files.

In [38]:
model = MultinomialNB().fit(train_matrix, train_labels)

Now that we've trained a model, we can try it out on a 1-star review (but we pick a review in the testing set to avoid the effects of overfitting).

In [40]:
neg_sample_text = open(
    '{}test/neg/9999_1.txt'.format(data_path)).read()
print(neg_sample_text)
When all we have anymore is pretty much reality TV shows with people making fools of themselves for whatever reason be it too fat or can't sing or cook worth a damn than I know Hollywood has run out of original ideas. I can not recall a time when anything original or intelligent came out on TV in the last 15 years. What is our obsession with watching bums make fools of themselves? I would have thought these types of programs would have run full circle but every year they come up with something new that is more strange then the one before. OK so people in this one need to lose weight...most Americans need to lose weight. I just think we all to some degree enjoy watching people humiliated. Maybe it makes us feel better when we see someone else looking like a jerk. I don't know but I just wish something intelligent would come out that did not insult your intelligence.

The overall sentiment seems fairly negative. Let's see if our model agrees by computing the class probabilities (negative first, then positive).

In [41]:
neg_sample_vec = vectorizer.transform([neg_sample_text])
model.predict_proba(neg_sample_vec)
Out[41]:
array([[ 0.81292637,  0.18707363]])

That seems fairly conclusive.

Let's do the same thing for a 10-star movie review in the testing set.

In [42]:
pos_sample_text = open(
    '{}test/pos/9999_10.txt'.format(data_path)).read()
print(pos_sample_text)
Although I'm not a golf fan, I attended a sneak preview of this movie and absolutely loved it. The historical settings, the blatant class distinctions, and seeing the good and the bad on both sides of the dividing line held my attention throughout. The actors and their characterizations were all mesmerizing. And I was on the edge of my seat during the golf segments, which were not only dramatic and exciting but easy to follow. Toward the end of this movie, "Seabiscuit" came strongly to mind, although "The Greatest Game Ever Played" is far less complex a story than that film. In both cases, the fact that the events really happened deepened my interest.

We would expect this review to be fairly clearly positive, based on the text alone. What does the model say?

In [43]:
pos_sample_vec = vectorizer.transform([pos_sample_text])
model.predict_proba(pos_sample_vec)
Out[43]:
array([[ 0.33913831,  0.66086169]])

The class probabilities are closer to one another, but the positive sentiment is stronger, which is a good sign.

Even though it's not from a review, let's see how the model would deal with a tricky sentence with a "not" in it.

In [19]:
stuff = vectorizer.transform(
    ['A ten pound laptop is not a good travel companion.'])
model.predict_proba(stuff)
Out[19]:
array([[ 0.5583131,  0.4416869]])

Nice! We don't have high confidence but it's a correct classification.


Back to top

PERFORMANCE EVALUATION<a name=perf></a>

It's not enough to try out the sentiment analysis on 1 or 2 reviews: how well does the model perform on the 25,000 testing cases?

We'll need to load the testing documents before we can compute some evaluation metrics.

In [49]:
test_docs = []
test_labels = []

pos_file_names = glob.glob('{}test/pos/*.txt'.format(data_path))
for file_name in pos_file_names:
    test_docs.append(open(file_name).read())
    test_labels.append(1)
neg_file_names = glob.glob('{}test/neg/*.txt'.format(data_path))
for file_name in neg_file_names:
    test_docs.append(open(file_name).read())
    test_labels.append(0)

We get the feature vectors for the test data and the model's predictions. Note that we use transform on testing data rather than fit_transform.

In [50]:
test_matrix = vectorizer.transform(test_docs)
predicted = model.predict(test_matrix)

We look at precision, recall and the F1-score on the testing set.

Precision is the fraction of predicted positive results that are actually true positives, whereas recall is the proportion of true positives that are recognized as such by the classification model. Ideally, both of these values would be near 1.

The F1-score is the harmonic mean of these quantities.

In [51]:
print(metrics.classification_report(
    test_labels, predicted, target_names=['neg', 'pos']))
             precision    recall  f1-score   support

        neg       0.77      0.87      0.82     12500
        pos       0.85      0.74      0.79     12500

avg / total       0.81      0.81      0.80     25000

The values are actually fairly good!


Back to top

VADER <a name=vader></a>

NLTK comes with a pre-trained sentiment analyzer called vader. Pre-trained in this context means that it has been trained on a dataset that does not necessarily contain positive and negative movie reviews.

We'll see how it performs on the testing set, but first we'll try it on the trick sentence from above.

In [52]:
sia = SentimentIntensityAnalyzer()
In [53]:
sia.polarity_scores('A ten pound laptop is not a good travel companion.')
Out[53]:
{'compound': -0.3412, 'neg': 0.256, 'neu': 0.744, 'pos': 0.0}

vader wasn't fooled: it recognizes that it's likely to be a neutral sentence, or possibly a negative sentence, but not a positive sentence.

To evaluate on the test data, we find the prediction for each test document, and load them into classification_report (remember that it hasn't been trained on the movie review data).

In [54]:
vader_predicted = []
for doc in test_docs:
    scores = sia.polarity_scores(doc)
    if scores['pos'] > scores['neg']:
        vader_predicted.append(1)
    else:
        vader_predicted.append(0)

We get the following performance metrics.

In [55]:
print(metrics.classification_report(
    test_labels, vader_predicted, target_names=['neg', 'pos']))
             precision    recall  f1-score   support

        neg       0.79      0.52      0.63     12500
        pos       0.64      0.86      0.74     12500

avg / total       0.72      0.69      0.68     25000