怎么用WebScraping爬取HTML网页

这篇文章主要讲解了“怎么用Web Scraping爬取HTML网页”，文中的讲解内容简单清晰，易于学习与理解，下面请大家跟着小编的思路慢慢深入，一起来研究和学习“怎么用Web Scraping爬取HTML网页”吧！

十多年的茂南网站建设经验，针对设计、前端、开发、售后、文案、推广等六对一服务，响应快，48小时及时工作处理。网络营销推广的优势是能够根据用户设备显示端的尺寸不同，自动调整茂南建站的显示方式，使网站能够适用不同显示终端，在浏览器中调整网站的宽度，无论在任何一种浏览器上浏览网站，都能展现优雅布局与设计，从而大程度地提升浏览体验。成都创新互联从事“茂南网站设计”,“茂南网站推广”以来，每个客户项目都认真落实执行。

-爬取HTML网页

-直接下载数据文件，例如csv，txt，pdf文件

-通过应用程序编程接口（API）访问数据，例如电影数据库，Twitter

选择网页爬取，当然了解HTML网页的基本结构，可以参考这个网页：

HTML的基本结构

HTML标记：head，body，p，a，form，table等等

标签会具有属性。例如，标记a具有属性（或属性）href的链接的目标。

class和id是html用来通过级联样式表（CSS）控制每个元素的样式的特殊属性。 id是元素的唯一标识符，而class用于将元素分组以进行样式设置。

一个元素可以与多个类相关联。这些类别之间用空格隔开，例如 <h3 class=“ city main”>伦敦</ h3>

下图是来自W3SCHOOL的例子，city的包括三个属性，main包括一个属性，London运用了两个city和main，这两个类，呈现出来的是下图的样子。

可以通过标签相对于彼此的位置来引用标签

child-child是另一个标签内的标签，例如这两个p标签是div标签的子标签。

parent-parent是一个标签，另一个标签在其中，例如 html标签是body标签的parent标签。

siblings-siblings是与另一个标签具有相同parent标签的标签，例如在html示例中，head和body标签是同级标签，因为它们都在html内。两个p标签都是sibling，因为它们都在body里面。

四步爬取网页：

第一步：安装模块

安装requests,beautifulsoup4,用来爬取网页信息

Install modules requests, BeautifulSoup4/scrapy/selenium/....requests: allow you to send HTTP/1.1 requests using Python. To install:Open terminal (Mac) or Anaconda Command Prompt (Windows)code: BeautifulSoup: web page parsing library, to install, use:

第二步：利用安装包来读取网页源码

第三步：浏览网页源码找到需要读取信息的位置

这里不同的浏览器读取源码有差异，下面介绍几个，有相关网页查询详细信息。

Firefox: right click on the web page and select "view page source"Safari: please instruction here to see page source ()Ineternet Explorer: see instruction at

第四步：开始读取

Beautifulsoup: 简单那，支持CSS Selector, 但不支持 XPathscrapy (): 支持 CSS Selector 和XPathSelenium: 可以爬取动态网页（例如下拉不断更新的）lxml等BeautifulSoup里Tag: an xml or HTML tag 标签Name: every tag has a name 每个标签的名字Attributes: a tag may have any number of attributes. 每个标签有一个到多个属性 A tag is shown as a dictionary in the form of {attribute1_name:attribute1_value, attribute2_name:attribute2_value, ...}. If an attribute has multiple values, the value is stored as a listNavigableString: the text within a tag

上代码：

#Import requests and beautifulsoup packages

from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity="all"

# import requests package

import requests

# import BeautifulSoup from package bs4 (i.e. beautifulsoup4)

from bs4 import BeautifulSoup

Get web page content

# send a get request to the web page

page=requests.get("A simple example page")

# status_code 200 indicates success.

# a status code >200 indicates a failure

if page.status_code==200:

# content property gives the content returned in bytes

print(page.content) # text in bytes

print(page.text) # text in unicode

#Parse web page content

# Process the returned content using beautifulsoup module

# initiate a beautifulsoup object using the html source and Python’s html.parser

soup=BeautifulSoup(page.content, 'html.parser')

# soup object stands for the **root**

# node of the html document tree

print("Soup object:")

# print soup object nicely

print(soup.prettify())

# soup.children returns an iterator of all children nodes

print("\soup children nodes:")

soup_children=soup.children

print(soup_children)

# convert to list

soup_children=list(soup.children)

print("\nlist of children of root:")

print(len(soup_children))

# html is the only child of the root node

html=soup_children[0]

html

# Get head and body tag

html_children=list(html.children)

print("how many children under html? ", len(html_children))

for idx, child in enumerate(html_children):

print("Child {} is: {}\n".format(idx, child))

# head is the second child of html

head=html_children[1]

# extract all text inside head

print("\nhead text:")

print(head.get_text())

# body is the fourth child of html

body=html_children[3]

# Get details of a tag

# get the first p tag in the div of body

div=list(body.children)[1]

p=list(div.children)[1]

# get the details of p tag

# first, get the data type of p

print("\ndata type:")

print(type(p))

# get tag name (property of p object)

print ("\ntag name: ")

print(p.name)

# a tag object with attributes has a dictionary

# use <tag>.attrs to get the dictionary

# each attribute name of the tag is a key

# get all attributes

p.attrs

# get "class" attribute

print ("\ntag class: ")

print(p["class"])

# how to determine if 'id' is an attribute of p?

# get text of p tag

p.get_text()

感谢各位的阅读，以上就是“怎么用Web Scraping爬取HTML网页”的内容了，经过本文的学习后，相信大家对怎么用Web Scraping爬取HTML网页这一问题有了更深刻的体会，具体使用情况还需要大家实践验证。这里是创新互联，小编将为大家推送更多相关知识点的文章，欢迎关注！

当前标题：怎么用WebScraping爬取HTML网页
网站路径：https://www.cdcxhl.com/article24/ghdije.html

成都网站建设公司_创新互联，为您提供静态网站、网站内链、云服务器、域名注册、虚拟主机、App开发

声明：本网站发布的内容（图片、视频和文字）以用户投稿、用户转载内容为主，如果涉及侵权请尽快告知，我们将会在第一时间删除。文章观点不代表本网站立场，如需处理请联系客服。电话：028-86922220；邮箱：631063699@qq.com。内容未经允许不得转载，或转载时需注明来源：创新互联

猜你还喜欢下面的内容