R 网页数据爬虫1

来源:互联网 时间:2016-10-07

For collecting and analyzing data.

【启示】本处所分享的内容均是笔者从一些专业书籍中学习所得,也许会有一些自己使用过程中的技巧、心得、小经验一类的,但远比不上书中所讲述的精彩翔实。只因自己在学习过程中深感在R爬虫应用中互联网可搜索的公开资源并不如其它知识丰富,特此稍作分享以供后来者鉴,也因此关于这一块的内容不做原创声明,欢迎朋友们一起交流学习、批评指正,以期共同进步。EMAIL:[email protected]

1.WHY R?

即使对于非专业人员而言,也多少耳闻目前的R在爬虫应用的表现也远不如其它软件,R既非专业适合的软件、而八爪鱼一类的简单应用也完全可以满足我们这些"偶尔的用户",那么为什么需要用R爬虫呢?我认为每一个来搜索R爬虫技巧的朋友都有自己的答案。

提醒几个个优势:

#1.FOR a software environment with a primarily statistical focus.

#2.there will be an amazing visual work.

#May be a complete set of operational procedures.

2.About basics.

we need threw ourselves into the preparation with some basic knowledge of HTML, XML and the logic of regular expressions and Xpath, BUT the operations are executed from WIHTIN R!

3.RECOMMENDATION

http://www.r-datacollection.com

4.A little case study.

#爬取电影票房信息

library(stringr)

library(XML)

library(maps)

#htmlParse()用来interpreting HTML

#创建一个object

movie_parsed<-htmlParse("http://58921.com/boxoffice/wangpiao/20161004",

encoding = "UTF-8")

#the next step:extract tables/data

#readHTMLTable() for identifying and reading out those tables

tables<-readHTMLTable(movie_parsed,stringsAsFactors=FALSE)

is.matrix(tables)

is.character(tables)

is.data.frame(tables)

is.list(tables)

#so we got an "list" format#

 

因为R对于中文的支持不是很好,所以碰到一些中文乱码是正常的,所以我们需要more advanced text manipulation tools.(本例中出现了部分列信息的完全丢失是因为该网站的某些列的数据是以.png格式放置的。)

5.ABC's of...

For browsing the Web, there is a hidden standard behind the scenes that structures how information is displayed.

#HTML or the hypertext markup language

Not a dedicated data storage format, but usually contains the useful information. And in general HTML is used to shape the display of information.

#XML the extensible markup language or XML

The main purpose of XML is to storage data. Thus HTML documents are interpreted and transformed in to pretty-looking output by browsers, whereas XML is "just" data wrapped in user-defined tags. The user-defined tags make XML much more flexible for storing data than HTML. Both HTML and XML-style document offer natrual, often hierarchical, structures for data storage. 

(unfinished......)

 

相关阅读:
Top