《機(jī)器學(xué)習(xí)實(shí)戰(zhàn)》中貝葉斯分類中導(dǎo)入RSS源例子

2019-11-14 17:41:32

字體：大中小

供稿：網(wǎng)友

跟著書(shū)中代碼往下寫(xiě)在這里卡住了，考慮到可能還會(huì)有其他同學(xué)也遇到了這樣的問(wèn)題，記下來(lái)分享。

先吐槽一下，相信大部分網(wǎng)友在這里卡住的主要原因是偉大的GFW，所以無(wú)論是軟件FQ還是肉身FQ的小伙伴們估計(jì)是無(wú)論如何也看不到這篇博文的，不想往下看的請(qǐng)自覺(jué)使用FQ技能。

怎么安裝feedparser？

按書(shū)中提供的網(wǎng)址直接安裝feedparser會(huì)提示錯(cuò)誤說(shuō)沒(méi)有setuptools，然后去找setuptools，官方的說(shuō)法是windows最好用ez_setup.py安裝，我確實(shí)下載不下來(lái)官網(wǎng)的那個(gè)ez_etup.py，這個(gè)帖子給出了解決方案：http://adesquared.Word PRess.com/2013/07/07/setting-up-python-and-easy_install-on-windows-7/

ez_setup.py

將這個(gè)文件直接拷貝到C://python27文件夾中，輸入命令行：python ez_setup.py install

然后轉(zhuǎn)到放feedparser安裝文件的文件夾中，命令行輸入：python setup.py install

作者提供的rss源鏈接“http://newyork.craigslist.org/stp/index.rss”不可訪問(wèn)怎么辦？

書(shū)中作者的意思是以來(lái)自源 http://newyork.craigslist.org/stp/index.rss 中的文章作為分類為1的文章，以來(lái)自源 http://sfbay.craigslist.org/stp/index.rss 中的文章作為分類為0的文章

為了能夠跑通示例代碼，可以找兩可用的RSS源作為替代。

我用的是這兩個(gè)源：

NASA Image of the Day：http://www.nasa.gov/rss/dyn/image_of_the_day.rss

Yahoo Sports - NBA - Houston Rockets News：http://sports.yahoo.com/nba/teams/hou/rss.xml

也就是說(shuō)，如果算法運(yùn)行正確的話，所有來(lái)自于 nasa 的文章將會(huì)被分類為1，所有來(lái)自于yahoo sports的休斯頓火箭隊(duì)新聞將會(huì)分類為0

使用自己定義的RSS源，當(dāng)程序運(yùn)行到trainNB0(array(trainMat),array(trainClasses))時(shí)會(huì)報(bào)錯(cuò)，怎么辦？

從書(shū)中作者的例子來(lái)看，作者使用的源中文章數(shù)量較多，len(ny['entries']) 為 100，我找的幾個(gè) RSS 源只有10-20個(gè)左右。

>>> import feedparser
>>>ny=feedparser.parse('http://newyork.craigslist.org/stp/index.rss')
>>> ny['entries']
>>> len(ny['entries'])
100

因?yàn)樽髡叩囊粋€(gè)RSS源有100篇文章，所以他可以在代碼中剔除了30個(gè)“停用詞”，隨機(jī)選擇20篇文章作為測(cè)試集，但是當(dāng)我們使用替代RSS源時(shí)我們只有10篇文章卻要取出20篇文章作為測(cè)試集，這樣顯然是會(huì)出錯(cuò)的。只要自己調(diào)整下測(cè)試集的數(shù)量就可以讓代碼跑通；如果文章中的詞太少，減少剔除的“停用詞”數(shù)量可以提高算法的準(zhǔn)確度。

如果不想將出現(xiàn)頻率排序最高的30個(gè)單詞移除，該如何去掉“停用詞”？

可以把要去掉的停用詞存放到txt文件中，使用時(shí)讀取（替代移除高頻詞的代碼）。具體需要停用哪些詞可以參考這里 http://www.ranks.nl/stopwords

以下代碼想正常運(yùn)行需要將停用詞保存至stopword.txt中。

我的txt中保存了以下單詞，效果還不錯(cuò)：

a
about
above
after
again
against
all
am
an
and
any
are
aren't
as
at
be
because
been
before
being
below
between
both
but
by
can't
cannot
could
couldn't
did
didn't
do
does
doesn't
doing
don't
down
during
each
few
for
from
further
had
hadn't
has
hasn't
have
haven't
having
he
he'd
he'll
he's
her
here
here's
hers
herself
him
himself
his
how
how's
i
i'd
i'll
i'm
i've
if
in
into
is
isn't
it
it's
its
itself
let's
me
more
most
mustn't
my
myself
no
nor
not
of
off
on
once
only
or
other
ought
our
ours
ourselves
out
over
own
same
shan't
she
she'd
she'll
she's
should
shouldn't
so
some
such
than
that
that's
the
their
theirs
them
themselves
then
there
there's
these
they
they'd
they'll
they're
they've
this
those
through
to
too
under
until
up
very
was
wasn't
we
we'd
we'll
we're
we've
were
weren't
what
what's
when
when's
where
where's
which
while
who
who's
whom
why
why's
with
won't
would
wouldn't
you
you'd
you'll
you're
you've
your
yours
yourself
yourselves

'''Created on Oct 19, 2010@author: Peter'''from numpy import *def loadDataSet():    postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help','my','dog', 'please'],                 ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],                 ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],                 ['stop', 'posting', 'stupid', 'worthless', 'garbage'],                 ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],                 ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]    classVec = [0,1,0,1,0,1]    #1 is abusive, 0 not    return postingList,classVec                 def createVocabList(dataSet):    vocabSet = set([])  #create empty set    for document in dataSet:        vocabSet = vocabSet | set(document) #union of the two sets    return list(vocabSet)def bagOfWords2Vec(vocabList, inputSet):    returnVec = [0]*len(vocabList)    for word in inputSet:        if word in vocabList:            returnVec[vocabList.index(word)] += 1        else: print "the word: %s is not in my Vocabulary!" % word    return returnVecdef trainNB0(trainMatrix,trainCategory):    numTrainDocs = len(trainMatrix)    numWords = len(trainMatrix[0])    pAbusive = sum(trainCategory)/float(numTrainDocs)    p0Num = ones(numWords); p1Num = ones(numWords)      #change to ones()     p0Denom = 2.0; p1Denom = 2.0                        #change to 2.0    for i in range(numTrainDocs):        if trainCategory[i] == 1:            p1Num += trainMatrix[i]            p1Denom += sum(trainMatrix[i])        else:            p0Num += trainMatrix[i]            p0Denom += sum(trainMatrix[i])    p1Vect = log(p1Num/p1Denom)          #change to log()    p0Vect = log(p0Num/p0Denom)          #change to log()    return p0Vect,p1Vect,pAbusivedef classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):    p1 = sum(vec2Classify * p1Vec) + log(pClass1)    #element-wise mult    p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)    if p1 > p0:        return 1    else:         return 0    def bagOfWords2VecMN(vocabList, inputSet):    returnVec = [0]*len(vocabList)    for word in inputSet:        if word in vocabList:            returnVec[vocabList.index(word)] += 1    return returnVecdef testingNB():    print '*** load dataset for training ***'    listOPosts,listClasses = loadDataSet()    print 'listOPost:/n',listOPosts    print 'listClasses:/n',listClasses    print '/n*** create Vocab List ***'    myVocabList = createVocabList(listOPosts)    print 'myVocabList:/n',myVocabList    print '/n*** Vocab show in post Vector Matrix ***'    trainMat=[]    for postinDoc in listOPosts:        trainMat.append(bagOfWords2Vec(myVocabList, postinDoc))    print 'train matrix:',trainMat    print '/n*** train P0V p1V pAb ***'    p0V,p1V,pAb = trainNB0(array(trainMat),array(listClasses))    print 'p0V:/n',p0V    print 'p1V:/n',p1V    print 'pAb:/n',pAb    print '/n*** classify ***'    testEntry = ['love', 'my', 'dalmation']    thisDoc = array(bagOfWords2Vec(myVocabList, testEntry))    print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)    testEntry = ['stupid', 'garbage']    thisDoc = array(bagOfWords2Vec(myVocabList, testEntry))    print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)def textParse(bigString):    #input is big string, #output is word list    import re    listOfTokens = re.split(r'/W*', bigString)    return [tok.lower() for tok in listOfTokens if len(tok) > 2]     def spamTest():    docList=[]; classList = []; fullText =[]    for i in range(1,26):        wordList = textParse(open('email/spam/%d.txt' % i).read())        docList.append(wordList)        fullText.extend(wordList)        classList.append(1)        wordList = textParse(open('email/ham/%d.txt' % i).read())        docList.append(wordList)        fullText.extend(wordList)        classList.append(0)    vocabList = createVocabList(docList)#create vocabulary    trainingSet = range(50); testSet=[]           #create test set    for i in range(10):        randIndex = int(random.uniform(0,len(trainingSet)))        testSet.append(trainingSet[randIndex])        del(trainingSet[randIndex])      trainMat=[]; trainClasses = []    for docIndex in trainingSet:#train the classifier (get probs) trainNB0        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))        trainClasses.append(classList[docIndex])    p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))    errorCount = 0    for docIndex in testSet:        #classify the remaining items        wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])        if classifyNB(array(wordVector),p0V,p1V,pSpam) != classList[docIndex]:            errorCount += 1            print "classification error",docList[docIndex]    print 'the error rate is: ',float(errorCount)/len(testSet)    #return vocabList,fullTextdef calcMostFreq(vocabList,fullText):    import Operator    freqDict = {}    for token in vocabList:        freqDict[token]=fullText.count(token)    sortedFreq = sorted(freqDict.iteritems(), key=operator.itemgetter(1), reverse=True)     return sortedFreq[:30]       def stopWords():    import re    wordList =  open('stopword.txt').read() # see http://www.ranks.nl/stopwords    listOfTokens = re.split(r'/W*', wordList)    return [tok.lower() for tok in listOfTokens]     print 'read stop word from /'stopword.txt/':',listOfTokens    return listOfTokensdef localWords(feed1,feed0):    import feedparser    docList=[]; classList = []; fullText =[]    print 'feed1 entries length: ', len(feed1['entries']), '/nfeed0 entries length: ', len(feed0['entries'])    minLen = min(len(feed1['entries']),len(feed0['entries']))    print '/nmin Length: ', minLen    for i in range(minLen):        wordList = textParse(feed1['entries'][i]['summary'])        print '/nfeed1/'s entries[',i,']/'s summary - ','parse text:/n',wordList        docList.append(wordList)        fullText.extend(wordList)        classList.append(1) #NY is class 1        wordList = textParse(feed0['entries'][i]['summary'])        print '/nfeed0/'s entries[',i,']/'s summary - ','parse text:/n',wordList        docList.append(wordList)        fullText.extend(wordList)        classList.append(0)    vocabList = createVocabList(docList)#create vocabulary    print '/nVocabList is ',vocabList    print '/nRemove Stop Word:'    stopWordList = stopWords()    for stopWord in stopWordList:        if stopWord in vocabList:            vocabList.remove(stopWord)            print 'Removed: ',stopWord##    top30Words = calcMostFreq(vocabList,fullText)   #remove top 30 words##    print '/nTop 30 words: ', top30Words##    for pairW in top30Words:##        if pairW[0] in vocabList:##            vocabList.remove(pairW[0])##            print '/nRemoved: ',pairW[0]    trainingSet = range(2*minLen); testSet=[]           #create test set    print '/n/nBegin to create a test set: /ntrainingSet:',trainingSet,'/ntestSet',testSet    for i in range(5):        randIndex = int(random.uniform(0,len(trainingSet)))        testSet.append(trainingSet[randIndex])        del(trainingSet[randIndex])    print 'random select 5 sets as the testSet:/ntrainingSet:',trainingSet,'/ntestSet',testSet    trainMat=[]; trainClasses = []    for docIndex in trainingSet:#train the classifier (get probs) trainNB0        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))        trainClasses.append(classList[docIndex])    print '/ntrainMat length:',len(trainMat)    print '/ntrainClasses',trainClasses    print '/n/ntrainNB0:'    p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))    #print '/np0V:',p0V,'/np1V',p1V,'/npSpam',pSpam    errorCount = 0    for docIndex in testSet:        #classify the remaining items        wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])        classifiedClass = classifyNB(array(wordVector),p0V,p1V,pSpam)        originalClass = classList[docIndex]        result =  classifiedClass != originalClass        if result:            errorCount += 1        print '/n',docList[docIndex],'/nis classified as: ',classifiedClass,', while the original class is: ',originalClass,'. --',not result    print '/nthe error rate is: ',float(errorCount)/len(testSet)    return vocabList,p0V,p1Vdef testRSS():    import feedparser    ny=feedparser.parse('http://www.nasa.gov/rss/dyn/image_of_the_day.rss')    sf=feedparser.parse('http://sports.yahoo.com/nba/teams/hou/rss.xml')    vocabList,pSF,pNY = localWords(ny,sf)def testTopWords():    import feedparser    ny=feedparser.parse('http://www.nasa.gov/rss/dyn/image_of_the_day.rss')    sf=feedparser.parse('http://sports.yahoo.com/nba/teams/hou/rss.xml')    getTopWords(ny,sf)def getTopWords(ny,sf):    import operator    vocabList,p0V,p1V=localWords(ny,sf)    topNY=[]; topSF=[]    for i in range(len(p0V)):        if p0V[i] > -6.0 : topSF.append((vocabList[i],p0V[i]))        if p1V[i] > -6.0 : topNY.append((vocabList[i],p1V[i]))    sortedSF = sorted(topSF, key=lambda pair: pair[1], reverse=True)    print "SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**"    for item in sortedSF:        print item[0]    sortedNY = sorted(topNY, key=lambda pair: pair[1], reverse=True)    print "NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**"    for item in sortedNY:        print item[0]def test42():    print '/n*** Load DataSet ***'    listOPosts,listClasses = loadDataSet()    print 'List of posts:/n', listOPosts    print 'List of Classes:/n', listClasses    print '/n*** Create Vocab List ***'    myVocabList = createVocabList(listOPosts)    print 'Vocab List from posts:/n', myVocabList    print '/n*** Vocab show in post Vector Matrix ***'    trainMat=[]    for postinDoc in listOPosts:        trainMat.append(bagOfWords2Vec(myVocabList,postinDoc))    print 'Train Matrix:/n', trainMat    print '/n*** Train ***'    p0V,p1V,pAb = trainNB0(trainMat,listClasses)    print 'p0V:/n',p0V    print 'p1V:/n',p1V    print 'pAb:/n',pAb