PHP擴展curl和正則表達式輕松采集新聞

2020-03-24 17:10:41

字體：大中小

來源：轉載

供稿：網友

采集已經不是什么新名詞了，很多站長為了省事，也局限于人力的缺乏，使用程序來給自己的網站添磚加瓦，比如本人的個人網站www.xxfsw.com也采集了大量的新聞，那么如果實現呢？今天我們運用php來實現這個功能。談到采集，我們不得不說兩個東西，第一個是如何獲取遠程網站的源代碼，這個可以通過php的一個擴展curl來獲取，另一個是如果去匹配你需要的信息，這個的解決辦法是html' target='_blank'>正則表達式。Windows下開啟curl的方法如下：1、拷貝PHP目錄中的libeay32.dll， ssleay32.dll， php5ts.dll， php_curl.dll文件到 system32 目錄。2、修改php.ini：配置好 extension_dir ，去掉 extension = php_curl.dll 前面的分號。3、重起apache。Linux下開啟curl的方法如下：進入安裝原php 的源碼目錄，cd ext
cd curl
phpize
./configure --with-curl =DIR
make就會在PHPDIR/ext/curl /moudles/下生成curl .so的文件。復制curl .so文件到extensions的配置目錄，修改php .ini就好了。然后你就可以利用curl來獲取到指定url的網頁源碼了，這里給大家一個封裝好的函數：
以下為引用的內容：
function getwebcontent($url){
$ch = curl_init();
$timeout = 10;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, 1);
$contents = trim(curl_exec($ch));
curl_close($ch);
return $contents;
}
以下為引用的內容：
p+匹配至少一個含p的字符串
p*陪陪任何包含0個或多個p的字符串
p?匹配任何包含0個或一個p的字符串
p{2}匹配包含2個p的序列的字符串
p{2,3}匹配任何包含2個或3個的字符串
p$匹配任何以p結尾的字符串
^p匹配任何以p開頭的字符串
[^a-zA-Z]匹配任何不包含a-zA-Z的字符串
p.p匹配任何包含p、接下來是任何字符、再接下來有又是p的字符串
^.{2}$匹配任何值包含2個字符的字符串
b (.*)b 匹配任何被 b 包圍的字符串
p(hp)*匹配任何一個包含p,后面是多個或0個hp的字符串
以下為引用的內容：
[:alpha:]同[a-zA-Z]
[:alnum:]同[a-zA-Z0-9]
[:cntrl:]匹配控制字符，比如制表符，反斜杠，退格符
[:digit:]同[0-9]
[:graph:]所有ASCII33~166范圍內可以打印的字符
[:lower:]同[a-z]
[:punct:]標點符號
[:upper:]同[A-Z]
[:space:]空白字符，可以是空格、水平制表符、換行、換頁、回車
[:xdigit:]十六進制符同[a-fA-F0-9]
以下為引用的內容：
?php
header( Content-type: text/html; charset=utf-8
getinfo( http://rss.sina.com.cn/rollnews/news/gn_total.js ,1);
getinfo( http://rss.sina.com.cn/rollnews/news/gj_total.js ,2);
getinfo( http://rss.sina.com.cn/rollnews/news/sh_total.js ,3);
getinfo( http://rss.sina.com.cn/rollnews/sports/sports_total.js ,4);
getinfo( http://rss.sina.com.cn/rollnews/tech/tech1_total.js ,5);
getinfo( http://rss.sina.com.cn/rollnews/finance/finance1_news_total.js ,6);
getinfo( http://rss.sina.com.cn/rollnews/ent/ent_total.js ,7);
getinfo( http://rss.sina.com.cn/rollnews/jczs/jczs_total.js ,8);
function getinfo($infourl,$catid)
{
$pagecontent=getwebcontent($infourl);
preg_match_all( /title:/ (.*?)/ / , $pagecontent, $match);
$titlearr=$match[1];
preg_match_all( /link:/ (.*?)/ / , $pagecontent, $match);
$urlarr=$match[1];
for ($i=1;$i count($urlarr);$i++){
echo go {$titlearr[$i-1]}/n
$title=iconv( gbk , utf-8 ,$titlearr[$i-1]);
$content=iconv( gbk , utf-8 ,getnewscontent($urlarr[$i]));
$content=mysql_escape_string($content);
if(!insertdb($title,$content,$catid)) break;
}
}
function insertdb($title,$content,$catid){
將數據寫入你的庫
}
function getnewscontent($newsurl){
$newscontent=getwebcontent($newsurl);
preg_match_all( / div > $content=preg_replace( / a.*? //a /si , ,$match[1][0]);
$content=preg_replace( / div > $content=preg_replace( / div > $content=str_replace( div > return $content;
}
function getwebcontent($url){
$ch = curl_init();
$timeout = 10;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, 1);
$contents = trim(curl_exec($ch));
curl_close($ch);
return $contents;
}
?
然后如何實現比較實時的同步呢，這可以利用windows下的任務計劃或linux下的crontab 了，定時（比如十分鐘）執行這個程序，這樣，你就不再愁網站沒有內容了，哈哈，另外本人開了個工作室www.beijingjianzhan.com（北京建站），我們開發了一個系統，不僅能夠采集信息，而且能自動地進行再加工，進行偽原創，這樣就更符合搜索引擎的品味了，讓你的網站瘋狂地被收錄吧，另外可以加我的Q376504340討論技術性話題。html教程

鄭重聲明：本文版權歸原作者所有，轉載文章僅為傳播更多信息之目的，如作者信息標記有誤，請第一時間聯系我們修改或刪除，多謝。

上一篇：phpwind推進中小網站生態鏈啟動全國巡回會議

下一篇：php教程：php設計模式介紹之規范模式