跟著書(shū)中代碼往下寫(xiě)在這里卡住了,考慮到可能還會(huì)有其他同學(xué)也遇到了這樣的問(wèn)題,記下來(lái)分享。
先吐槽一下,相信大部分網(wǎng)友在這里卡住的主要原因是偉大的GFW,所以無(wú)論是軟件FQ還是肉身FQ的小伙伴們估計(jì)是無(wú)論如何也看不到這篇博文的,不想往下看的請(qǐng)自覺(jué)使用FQ技能。
按書(shū)中提供的網(wǎng)址直接安裝feedparser會(huì)提示錯(cuò)誤說(shuō)沒(méi)有setuptools,然后去找setuptools,官方的說(shuō)法是windows最好用ez_setup.py安裝,我確實(shí)下載不下來(lái)官網(wǎng)的那個(gè)ez_etup.py,這個(gè)帖子給出了解決方案:http://adesquared.WordPRess.com/2013/07/07/setting-up-python-and-easy_install-on-windows-7/
將這個(gè)文件直接拷貝到C://python27文件夾中,輸入命令行:python ez_setup.py install
然后轉(zhuǎn)到放feedparser安裝文件的文件夾中,命令行輸入:python setup.py install
書(shū)中作者的意思是以來(lái)自源 http://newyork.craigslist.org/stp/index.rss 中的文章作為分類為1的文章,以來(lái)自源 http://sfbay.craigslist.org/stp/index.rss 中的文章作為分類為0的文章
為了能夠跑通示例代碼,可以找兩可用的RSS源作為替代。
我用的是這兩個(gè)源:
NASA Image of the Day:http://www.nasa.gov/rss/dyn/image_of_the_day.rss
Yahoo Sports - NBA - Houston Rockets News:http://sports.yahoo.com/nba/teams/hou/rss.xml
也就是說(shuō),如果算法運(yùn)行正確的話,所有來(lái)自于 nasa 的文章將會(huì)被分類為1,所有來(lái)自于yahoo sports的休斯頓火箭隊(duì)新聞將會(huì)分類為0
從書(shū)中作者的例子來(lái)看,作者使用的源中文章數(shù)量較多,len(ny['entries']) 為 100,我找的幾個(gè) RSS 源只有10-20個(gè)左右。
>>> import feedparser
>>>ny=feedparser.parse('http://newyork.craigslist.org/stp/index.rss')
>>> ny['entries']
>>> len(ny['entries'])
100
因?yàn)樽髡叩囊粋€(gè)RSS源有100篇文章,所以他可以在代碼中剔除了30個(gè)“停用詞”,隨機(jī)選擇20篇文章作為測(cè)試集,但是當(dāng)我們使用替代RSS源時(shí)我們只有10篇文章卻要取出20篇文章作為測(cè)試集,這樣顯然是會(huì)出錯(cuò)的。只要自己調(diào)整下測(cè)試集的數(shù)量就可以讓代碼跑通;如果文章中的詞太少,減少剔除的“停用詞”數(shù)量可以提高算法的準(zhǔn)確度。
可以把要去掉的停用詞存放到txt文件中,使用時(shí)讀取(替代移除高頻詞的代碼)。具體需要停用哪些詞可以參考這里 http://www.ranks.nl/stopwords
以下代碼想正常運(yùn)行需要將停用詞保存至stopword.txt中。
我的txt中保存了以下單詞,效果還不錯(cuò):
a
about
above
after
again
against
all
am
an
and
any
are
aren't
as
at
be
because
been
before
being
below
between
both
but
by
can't
cannot
could
couldn't
did
didn't
do
does
doesn't
doing
don't
down
during
each
few
for
from
further
had
hadn't
has
hasn't
have
haven't
having
he
he'd
he'll
he's
her
here
here's
hers
herself
him
himself
his
how
how's
i
i'd
i'll
i'm
i've
if
in
into
is
isn't
it
it's
its
itself
let's
me
more
most
mustn't
my
myself
no
nor
not
of
off
on
once
only
or
other
ought
our
ours
ourselves
out
over
own
same
shan't
she
she'd
she'll
she's
should
shouldn't
so
some
such
than
that
that's
the
their
theirs
them
themselves
then
there
there's
these
they
they'd
they'll
they're
they've
this
those
through
to
too
under
until
up
very
was
wasn't
we
we'd
we'll
we're
we've
were
weren't
what
what's
when
when's
where
where's
which
while
who
who's
whom
why
why's
with
won't
would
wouldn't
you
you'd
you'll
you're
you've
your
yours
yourself
yourselves
'''Created on Oct 19, 2010@author: Peter'''from numpy import *def loadDataSet(): postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help','my','dog', 'please'], ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'], ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'], ['stop', 'posting', 'stupid', 'worthless', 'garbage'], ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'], ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']] classVec = [0,1,0,1,0,1] #1 is abusive, 0 not return postingList,classVec def createVocabList(dataSet): vocabSet = set([]) #create empty set for document in dataSet: vocabSet = vocabSet | set(document) #union of the two sets return list(vocabSet)def bagOfWords2Vec(vocabList, inputSet): returnVec = [0]*len(vocabList) for word in inputSet: if word in vocabList: returnVec[vocabList.index(word)] += 1 else: print "the word: %s is not in my Vocabulary!" % word return returnVecdef trainNB0(trainMatrix,trainCategory): numTrainDocs = len(trainMatrix) numWords = len(trainMatrix[0]) pAbusive = sum(trainCategory)/float(numTrainDocs) p0Num = ones(numWords); p1Num = ones(numWords) #change to ones() p0Denom = 2.0; p1Denom = 2.0 #change to 2.0 for i in range(numTrainDocs): if trainCategory[i] == 1: p1Num += trainMatrix[i] p1Denom += sum(trainMatrix[i]) else: p0Num += trainMatrix[i] p0Denom += sum(trainMatrix[i]) p1Vect = log(p1Num/p1Denom) #change to log() p0Vect = log(p0Num/p0Denom) #change to log() return p0Vect,p1Vect,pAbusivedef classifyNB(vec2Classify, p0Vec, p1Vec, pClass1): p1 = sum(vec2Classify * p1Vec) + log(pClass1) #element-wise mult p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1) if p1 > p0: return 1 else: return 0 def bagOfWords2VecMN(vocabList, inputSet): returnVec = [0]*len(vocabList) for word in inputSet: if word in vocabList: returnVec[vocabList.index(word)] += 1 return returnVecdef testingNB(): print '*** load dataset for training ***' listOPosts,listClasses = loadDataSet() print 'listOPost:/n',listOPosts print 'listClasses:/n',listClasses print '/n*** create Vocab List ***' myVocabList = createVocabList(listOPosts) print 'myVocabList:/n',myVocabList print '/n*** Vocab show in post Vector Matrix ***' trainMat=[] for postinDoc in listOPosts: trainMat.append(bagOfWords2Vec(myVocabList, postinDoc)) print 'train matrix:',trainMat print '/n*** train P0V p1V pAb ***' p0V,p1V,pAb = trainNB0(array(trainMat),array(listClasses)) print 'p0V:/n',p0V print 'p1V:/n',p1V print 'pAb:/n',pAb print '/n*** classify ***' testEntry = ['love', 'my', 'dalmation'] thisDoc = array(bagOfWords2Vec(myVocabList, testEntry)) print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb) testEntry = ['stupid', 'garbage'] thisDoc = array(bagOfWords2Vec(myVocabList, testEntry)) print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)def textParse(bigString): #input is big string, #output is word list import re listOfTokens = re.split(r'/W*', bigString) return [tok.lower() for tok in listOfTokens if len(tok) > 2] def spamTest(): docList=[]; classList = []; fullText =[] for i in range(1,26): wordList = textParse(open('email/spam/%d.txt' % i).read()) docList.append(wordList) fullText.extend(wordList) classList.append(1) wordList = textParse(open('email/ham/%d.txt' % i).read()) docList.append(wordList) fullText.extend(wordList) classList.append(0) vocabList = createVocabList(docList)#create vocabulary trainingSet = range(50); testSet=[] #create test set for i in range(10): randIndex = int(random.uniform(0,len(trainingSet))) testSet.append(trainingSet[randIndex]) del(trainingSet[randIndex]) trainMat=[]; trainClasses = [] for docIndex in trainingSet:#train the classifier (get probs) trainNB0 trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex])) trainClasses.append(classList[docIndex]) p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses)) errorCount = 0 for docIndex in testSet: #classify the remaining items wordVector = bagOfWords2VecMN(vocabList, docList[docIndex]) if classifyNB(array(wordVector),p0V,p1V,pSpam) != classList[docIndex]: errorCount += 1 print "classification error",docList[docIndex] print 'the error rate is: ',float(errorCount)/len(testSet) #return vocabList,fullTextdef calcMostFreq(vocabList,fullText): import Operator freqDict = {} for token in vocabList: freqDict[token]=fullText.count(token) sortedFreq = sorted(freqDict.iteritems(), key=operator.itemgetter(1), reverse=True) return sortedFreq[:30] def stopWords(): import re wordList = open('stopword.txt').read() # see http://www.ranks.nl/stopwords listOfTokens = re.split(r'/W*', wordList) return [tok.lower() for tok in listOfTokens] print 'read stop word from /'stopword.txt/':',listOfTokens return listOfTokensdef localWords(feed1,feed0): import feedparser docList=[]; classList = []; fullText =[] print 'feed1 entries length: ', len(feed1['entries']), '/nfeed0 entries length: ', len(feed0['entries']) minLen = min(len(feed1['entries']),len(feed0['entries'])) print '/nmin Length: ', minLen for i in range(minLen): wordList = textParse(feed1['entries'][i]['summary']) print '/nfeed1/'s entries[',i,']/'s summary - ','parse text:/n',wordList docList.append(wordList) fullText.extend(wordList) classList.append(1) #NY is class 1 wordList = textParse(feed0['entries'][i]['summary']) print '/nfeed0/'s entries[',i,']/'s summary - ','parse text:/n',wordList docList.append(wordList) fullText.extend(wordList) classList.append(0) vocabList = createVocabList(docList)#create vocabulary print '/nVocabList is ',vocabList print '/nRemove Stop Word:' stopWordList = stopWords() for stopWord in stopWordList: if stopWord in vocabList: vocabList.remove(stopWord) print 'Removed: ',stopWord## top30Words = calcMostFreq(vocabList,fullText) #remove top 30 words## print '/nTop 30 words: ', top30Words## for pairW in top30Words:## if pairW[0] in vocabList:## vocabList.remove(pairW[0])## print '/nRemoved: ',pairW[0] trainingSet = range(2*minLen); testSet=[] #create test set print '/n/nBegin to create a test set: /ntrainingSet:',trainingSet,'/ntestSet',testSet for i in range(5): randIndex = int(random.uniform(0,len(trainingSet))) testSet.append(trainingSet[randIndex]) del(trainingSet[randIndex]) print 'random select 5 sets as the testSet:/ntrainingSet:',trainingSet,'/ntestSet',testSet trainMat=[]; trainClasses = [] for docIndex in trainingSet:#train the classifier (get probs) trainNB0 trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex])) trainClasses.append(classList[docIndex]) print '/ntrainMat length:',len(trainMat) print '/ntrainClasses',trainClasses print '/n/ntrainNB0:' p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses)) #print '/np0V:',p0V,'/np1V',p1V,'/npSpam',pSpam errorCount = 0 for docIndex in testSet: #classify the remaining items wordVector = bagOfWords2VecMN(vocabList, docList[docIndex]) classifiedClass = classifyNB(array(wordVector),p0V,p1V,pSpam) originalClass = classList[docIndex] result = classifiedClass != originalClass if result: errorCount += 1 print '/n',docList[docIndex],'/nis classified as: ',classifiedClass,', while the original class is: ',originalClass,'. --',not result print '/nthe error rate is: ',float(errorCount)/len(testSet) return vocabList,p0V,p1Vdef testRSS(): import feedparser ny=feedparser.parse('http://www.nasa.gov/rss/dyn/image_of_the_day.rss') sf=feedparser.parse('http://sports.yahoo.com/nba/teams/hou/rss.xml') vocabList,pSF,pNY = localWords(ny,sf)def testTopWords(): import feedparser ny=feedparser.parse('http://www.nasa.gov/rss/dyn/image_of_the_day.rss') sf=feedparser.parse('http://sports.yahoo.com/nba/teams/hou/rss.xml') getTopWords(ny,sf)def getTopWords(ny,sf): import operator vocabList,p0V,p1V=localWords(ny,sf) topNY=[]; topSF=[] for i in range(len(p0V)): if p0V[i] > -6.0 : topSF.append((vocabList[i],p0V[i])) if p1V[i] > -6.0 : topNY.append((vocabList[i],p1V[i])) sortedSF = sorted(topSF, key=lambda pair: pair[1], reverse=True) print "SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**" for item in sortedSF: print item[0] sortedNY = sorted(topNY, key=lambda pair: pair[1], reverse=True) print "NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**" for item in sortedNY: print item[0]def test42(): print '/n*** Load DataSet ***' listOPosts,listClasses = loadDataSet() print 'List of posts:/n', listOPosts print 'List of Classes:/n', listClasses print '/n*** Create Vocab List ***' myVocabList = createVocabList(listOPosts) print 'Vocab List from posts:/n', myVocabList print '/n*** Vocab show in post Vector Matrix ***' trainMat=[] for postinDoc in listOPosts: trainMat.append(bagOfWords2Vec(myVocabList,postinDoc)) print 'Train Matrix:/n', trainMat print '/n*** Train ***' p0V,p1V,pAb = trainNB0(trainMat,listClasses) print 'p0V:/n',p0V print 'p1V:/n',p1V print 'pAb:/n',pAb
新聞熱點(diǎn)
疑難解答
圖片精選
網(wǎng)友關(guān)注