PHPCrawl爬蟲庫實現抓取酷狗歌單的方法示例講解

2020-03-22 19:44:16

字體：大中小

來源：轉載

供稿：網友

這篇文章主要介紹了PHPCrawl爬蟲庫實現抓取酷狗歌單的方法,涉及PHPCrawl爬蟲庫的使用及正則匹配相關操作技巧,需要的朋友可以參考下

本文實例講述了PHPCrawl爬蟲庫實現抓取酷狗歌單的方法。分享給大家供大家參考，具體如下：

本人看了網絡爬蟲相關的視頻后，手癢癢，想爬點什么。最近Facebook上表情包大戰很激烈，就想著把所有表情包都爬下來，卻一時沒有找到合適的VPN，因此把酷狗最近一月精選歌曲和簡單介紹抓取到本地。代碼寫得有點亂，自己不是很滿意，并不想放上來丟人現眼。不過轉念一想，這好歹是自己第一次爬蟲，于是...就有了如下不堪入目的代碼~~~（由于抓取的數據量較小，所以沒有考慮多進程什么的，不過我看了一下PHPCrawl的文檔，發現PHPCrawl庫已經把我能想到的功能都封裝好了，實現起來很方便）

 ?phpheader( Content-type:text/html;charset=utf-8 // It may take a whils to crawl a site ...set_time_limit(10000);include( libs/PHPCrawler.class.php class MyCrawler extends PHPCrawler { function handleDocumentInfo($DocInfo) { // Just detect linebreak for output ( /n in CLI-mode, otherwise br ). if (PHP_SAPI == cli ) $lb = /n  else $lb = br /  $url = $DocInfo-  $pat = /http:////www/.kugou/.com//yy//special//single///d+/.html/  if(preg_match($pat,$url) 0){ $this- parseSonglist($DocInfo); flush(); public function parseSonglist($DocInfo){ $content = $DocInfo- content; $songlistArr = array(); $songlistArr[ raw_url ] = $DocInfo-  //解析歌曲介紹 $matches = array(); $pat = / span 名稱： //span ([^( br)]+) br/  $ret = preg_match($pat,$content,$matches); if($ret 0){ $songlistArr[ title ] = $matches[1]; }else{ $songlistArr[ title ] =  //解析歌曲 $pat = / a title=/ ([^/ ]+)/ hidefocus=/ /  $matches = array(); preg_match_all($pat,$content,$matches); $songlistArr[ songs ] = array(); for($i = 0;$i count($matches[0]);$i++){ $song_title = $matches[1][$i]; array_push($songlistArr[ songs ],array( title = $song_title)); echo pre  print_r($songlistArr); echo /pre $crawler = new MyCrawler();// URL to crawl$start_url= http://www.kugou.com/yy/special/index/1-0-2.html $crawler- setURL($start_url);// Only receive content of files with content-type text/html $crawler- addContentTypeReceiveRule( #text/html# //鏈接擴展$crawler- addURLFollowRule( #http://www/.kugou/.com/yy/special/single//d+/.html$# i $crawler- addURLFollowRule( #http://www.kugou/.com/yy/special/index//d+-/d+-2/.html$# i // Store and send cookie-data like a browser does$crawler- enableCookieHandling(true);// Set the traffic-limit to 1 MB(1000 * 1024) (in bytes,// for testing we dont want to suck the whole site)//爬取大小無限制$crawler- setTrafficLimit(0);// Thats enough, now here we go$crawler- go();// At the end, after the process is finished, we print a short// report (see method getProcessReport() for more information)$report = $crawler- getProcessReport();if (PHP_SAPI == cli ) $lb = /n else $lb = br / echo Summary: .$lb;echo Links followed: .$report- links_followed.$lb;echo Documents received: .$report- files_received.$lb;echo Bytes received: .$report- bytes_received. bytes .$lb;echo Process runtime: .$report- process_runtime. sec .$lb; ?

PS：這里再為大家提供2款非常方便的正則表達式工具供大家參考使用：

JavaScript正則表達式在線測試工具：
http://tools.jb51.net/regex/javascript

正則表達式在線生成工具：
http://tools.jb51.net/regex/create_reg

您可能感興趣的文章:

PHP實現生成模糊圖片的方法示例講解

Laravel 5.5基于內置的Auth模塊實現前后臺登陸的詳解

PHP二維數組實現去除重復項的方法

以上就是PHPCrawl爬蟲庫實現抓取酷狗歌單的方法示例講解的詳細內容，PHP教程

鄭重聲明：本文版權歸原作者所有，轉載文章僅為傳播更多信息之目的，如作者信息標記有誤，請第一時間聯系我們修改或刪除，多謝。

上一篇：列舉ThinkPHP5與ThinkPHP3的一些異同點

下一篇：關于php 開發中加密的問題