如何使用python HTMLParser从HTML页面中抓取特定值(How to crawl a specific value out of a HTML page with python HTMLParser)

我们想象一下,我想从HTML页面中抓取一个特定的值,但是我没有明确的标识符(name =“abc”)。 我必须通过HTML层次结构找到值(在本例中为“dfgd454”:

<html><body><div id="pagecontent"><div id="container"><div id="content"><div id="tab-description"><div id="attributes"> <div class="attr"> <span class="name">Ugug</span> <span class="value">dfgd454</span> </div>

如何使用Python HTMLparser提取该值?

它必须是以下方式:

def handle_starttag(self, tag, attrs): if tag == 'div': attrD = dict(attrs) if attrD['class'] == 'attr':

但我知道代码不够......

感谢任何帮助,因为我搜索了很多,直到现在,并没有找到一个合适的解决方案。

Let's imagine I want to crawl a specific value out of a HTML page, but I have no clear identifier (name="abc") for that value. I have to find the value (in this case "dfgd454" through the HTML hierarchy:

<html><body><div id="pagecontent"><div id="container"><div id="content"><div id="tab-description"><div id="attributes"> <div class="attr"> <span class="name">Ugug</span> <span class="value">dfgd454</span> </div>

How can I extract that value with Python HTMLparser?

It has to be something in the way of:

def handle_starttag(self, tag, attrs): if tag == 'div': attrD = dict(attrs) if attrD['class'] == 'attr':

But I know that code is not sufficient...

Thankfull for any help because I googled a lot until now and did not find a proper solution.

最满意答案

你可以使用BeautifulSoup解析器。

from bs4 import BeautifulSoup s = '''<html><body><div id="pagecontent"><div id="container"><div id="content"><div id="tab-description"><div id="attributes"> <div class="attr"> <span class="name">Ugug</span> <span class="value">dfgd454</span> </div>''' soup = BeautifulSoup(s) print soup.select('div > span.value')[0].text

这将选择div标签的所有直接子span标签,其具有class属性值作为value 。

输出:

dfgd454

You could use BeautifulSoup parser.

from bs4 import BeautifulSoup s = '''<html><body><div id="pagecontent"><div id="container"><div id="content"><div id="tab-description"><div id="attributes"> <div class="attr"> <span class="name">Ugug</span> <span class="value">dfgd454</span> </div>''' soup = BeautifulSoup(s) print soup.select('div > span.value')[0].text

This would select all the immediate child span tags of div tag which has the class attribute value as value.

Output:

dfgd454如何使用python HTMLParser从HTML页面中抓取特定值(How to crawl a specific value out of a HTML page with python HTMLParser)

我们想象一下,我想从HTML页面中抓取一个特定的值,但是我没有明确的标识符(name =“abc”)。 我必须通过HTML层次结构找到值(在本例中为“dfgd454”:

<html><body><div id="pagecontent"><div id="container"><div id="content"><div id="tab-description"><div id="attributes"> <div class="attr"> <span class="name">Ugug</span> <span class="value">dfgd454</span> </div>

如何使用Python HTMLparser提取该值?

它必须是以下方式:

def handle_starttag(self, tag, attrs): if tag == 'div': attrD = dict(attrs) if attrD['class'] == 'attr':

但我知道代码不够......

感谢任何帮助,因为我搜索了很多,直到现在,并没有找到一个合适的解决方案。

Let's imagine I want to crawl a specific value out of a HTML page, but I have no clear identifier (name="abc") for that value. I have to find the value (in this case "dfgd454" through the HTML hierarchy:

<html><body><div id="pagecontent"><div id="container"><div id="content"><div id="tab-description"><div id="attributes"> <div class="attr"> <span class="name">Ugug</span> <span class="value">dfgd454</span> </div>

How can I extract that value with Python HTMLparser?

It has to be something in the way of:

def handle_starttag(self, tag, attrs): if tag == 'div': attrD = dict(attrs) if attrD['class'] == 'attr':

But I know that code is not sufficient...

Thankfull for any help because I googled a lot until now and did not find a proper solution.

最满意答案

你可以使用BeautifulSoup解析器。

from bs4 import BeautifulSoup s = '''<html><body><div id="pagecontent"><div id="container"><div id="content"><div id="tab-description"><div id="attributes"> <div class="attr"> <span class="name">Ugug</span> <span class="value">dfgd454</span> </div>''' soup = BeautifulSoup(s) print soup.select('div > span.value')[0].text

这将选择div标签的所有直接子span标签,其具有class属性值作为value 。

输出:

dfgd454

You could use BeautifulSoup parser.

from bs4 import BeautifulSoup s = '''<html><body><div id="pagecontent"><div id="container"><div id="content"><div id="tab-description"><div id="attributes"> <div class="attr"> <span class="name">Ugug</span> <span class="value">dfgd454</span> </div>''' soup = BeautifulSoup(s) print soup.select('div > span.value')[0].text

This would select all the immediate child span tags of div tag which has the class attribute value as value.

Output:

dfgd454