本文實例講述了PHPCrawl爬蟲庫實現抓取酷狗歌單的方法。分享給大家供大家參考,具體如下:
本人看了網絡爬蟲相關的視頻后,手癢癢,想爬點什么。最近Facebook上表情包大戰很激烈,就想著把所有表情包都爬下來,卻一時沒有找到合適的VPN,因此把酷狗最近一月精選歌曲和簡單介紹抓取到本地。代碼寫得有點亂,自己不是很滿意,并不想放上來丟人現眼。不過轉念一想,這好歹是自己第一次爬蟲,于是...就有了如下不堪入目的代碼~~~(由于抓取的數據量較小,所以沒有考慮多進程什么的,不過我看了一下PHPCrawl的文檔,發現PHPCrawl庫已經把我能想到的功能都封裝好了,實現起來很方便)
?phpheader( Content-type:text/html;charset=utf-8 // It may take a whils to crawl a site ...set_time_limit(10000);include( libs/PHPCrawler.class.php class MyCrawler extends PHPCrawler { function handleDocumentInfo($DocInfo) { // Just detect linebreak for output ( /n in CLI-mode, otherwise br ). if (PHP_SAPI == cli ) $lb = /n else $lb = br / $url = $DocInfo- $pat = /http:////www/.kugou/.com//yy//special//single///d+/.html/ if(preg_match($pat,$url) 0){ $this- parseSonglist($DocInfo); flush(); public function parseSonglist($DocInfo){ $content = $DocInfo- content; $songlistArr = array(); $songlistArr[ raw_url ] = $DocInfo- //解析歌曲介紹 $matches = array(); $pat = / span 名稱: //span ([^( br)]+) br/ $ret = preg_match($pat,$content,$matches); if($ret 0){ $songlistArr[ title ] = $matches[1]; }else{ $songlistArr[ title ] = //解析歌曲 $pat = / a title=/ ([^/ ]+)/ hidefocus=/ / $matches = array(); preg_match_all($pat,$content,$matches); $songlistArr[ songs ] = array(); for($i = 0;$i count($matches[0]);$i++){ $song_title = $matches[1][$i]; array_push($songlistArr[ songs ],array( title = $song_title)); echo pre print_r($songlistArr); echo /pre $crawler = new MyCrawler();// URL to crawl$start_url= http://www.kugou.com/yy/special/index/1-0-2.html $crawler- setURL($start_url);// Only receive content of files with content-type text/html $crawler- addContentTypeReceiveRule( #text/html# //鏈接擴展$crawler- addURLFollowRule( #http://www/.kugou/.com/yy/special/single//d+/.html$# i $crawler- addURLFollowRule( #http://www.kugou/.com/yy/special/index//d+-/d+-2/.html$# i // Store and send cookie-data like a browser does$crawler- enableCookieHandling(true);// Set the traffic-limit to 1 MB(1000 * 1024) (in bytes,// for testing we dont want to suck the whole site)//爬取大小無限制$crawler- setTrafficLimit(0);// Thats enough, now here we go$crawler- go();// At the end, after the process is finished, we print a short// report (see method getProcessReport() for more information)$report = $crawler- getProcessReport();if (PHP_SAPI == cli ) $lb = /n else $lb = br / echo Summary: .$lb;echo Links followed: .$report- links_followed.$lb;echo Documents received: .$report- files_received.$lb;echo Bytes received: .$report- bytes_received. bytes .$lb;echo Process runtime: .$report- process_runtime. sec .$lb; ?
PS:這里再為大家提供2款非常方便的正則表達式工具供大家參考使用:
JavaScript正則表達式在線測試工具:
http://tools.jb51.net/regex/javascript
正則表達式在線生成工具:
http://tools.jb51.net/regex/create_reg
PHP實現生成模糊圖片的方法示例講解
Laravel 5.5基于內置的Auth模塊實現前后臺登陸的詳解
PHP二維數組實現去除重復項的方法
以上就是PHPCrawl爬蟲庫實現抓取酷狗歌單的方法示例講解的詳細內容,PHP教程
鄭重聲明:本文版權歸原作者所有,轉載文章僅為傳播更多信息之目的,如作者信息標記有誤,請第一時間聯系我們修改或刪除,多謝。
新聞熱點
疑難解答