我们想象一下,我想从HTML页面中抓取一个特定的值,但是我没有明确的标识符(name =“abc”)。 我必须通过HTML层次结构找到值(在本例中为“dfgd454”:
<html><body><div id="pagecontent"><div id="container"><div id="content"><div id="tab-description"><div id="attributes"> <div class="attr"> <span class="name">Ugug</span> <span class="value">dfgd454</span> </div>如何使用Python HTMLparser提取该值?
它必须是以下方式:
def handle_starttag(self, tag, attrs): if tag == 'div': attrD = dict(attrs) if attrD['class'] == 'attr':但我知道代码不够......
感谢任何帮助,因为我搜索了很多,直到现在,并没有找到一个合适的解决方案。
Let's imagine I want to crawl a specific value out of a HTML page, but I have no clear identifier (name="abc") for that value. I have to find the value (in this case "dfgd454" through the HTML hierarchy:
<html><body><div id="pagecontent"><div id="container"><div id="content"><div id="tab-description"><div id="attributes"> <div class="attr"> <span class="name">Ugug</span> <span class="value">dfgd454</span> </div>How can I extract that value with Python HTMLparser?
It has to be something in the way of:
def handle_starttag(self, tag, attrs): if tag == 'div': attrD = dict(attrs) if attrD['class'] == 'attr':But I know that code is not sufficient...
Thankfull for any help because I googled a lot until now and did not find a proper solution.
最满意答案
你可以使用BeautifulSoup解析器。
from bs4 import BeautifulSoup s = '''<html><body><div id="pagecontent"><div id="container"><div id="content"><div id="tab-description"><div id="attributes"> <div class="attr"> <span class="name">Ugug</span> <span class="value">dfgd454</span> </div>''' soup = BeautifulSoup(s) print soup.select('div > span.value')[0].text这将选择div标签的所有直接子span标签,其具有class属性值作为value 。
输出:
dfgd454You could use BeautifulSoup parser.
from bs4 import BeautifulSoup s = '''<html><body><div id="pagecontent"><div id="container"><div id="content"><div id="tab-description"><div id="attributes"> <div class="attr"> <span class="name">Ugug</span> <span class="value">dfgd454</span> </div>''' soup = BeautifulSoup(s) print soup.select('div > span.value')[0].textThis would select all the immediate child span tags of div tag which has the class attribute value as value.
Output:
dfgd454如何使用python HTMLParser从HTML页面中抓取特定值(How to crawl a specific value out of a HTML page with python HTMLParser)我们想象一下,我想从HTML页面中抓取一个特定的值,但是我没有明确的标识符(name =“abc”)。 我必须通过HTML层次结构找到值(在本例中为“dfgd454”:
<html><body><div id="pagecontent"><div id="container"><div id="content"><div id="tab-description"><div id="attributes"> <div class="attr"> <span class="name">Ugug</span> <span class="value">dfgd454</span> </div>如何使用Python HTMLparser提取该值?
它必须是以下方式:
def handle_starttag(self, tag, attrs): if tag == 'div': attrD = dict(attrs) if attrD['class'] == 'attr':但我知道代码不够......
感谢任何帮助,因为我搜索了很多,直到现在,并没有找到一个合适的解决方案。
Let's imagine I want to crawl a specific value out of a HTML page, but I have no clear identifier (name="abc") for that value. I have to find the value (in this case "dfgd454" through the HTML hierarchy:
<html><body><div id="pagecontent"><div id="container"><div id="content"><div id="tab-description"><div id="attributes"> <div class="attr"> <span class="name">Ugug</span> <span class="value">dfgd454</span> </div>How can I extract that value with Python HTMLparser?
It has to be something in the way of:
def handle_starttag(self, tag, attrs): if tag == 'div': attrD = dict(attrs) if attrD['class'] == 'attr':But I know that code is not sufficient...
Thankfull for any help because I googled a lot until now and did not find a proper solution.
最满意答案
你可以使用BeautifulSoup解析器。
from bs4 import BeautifulSoup s = '''<html><body><div id="pagecontent"><div id="container"><div id="content"><div id="tab-description"><div id="attributes"> <div class="attr"> <span class="name">Ugug</span> <span class="value">dfgd454</span> </div>''' soup = BeautifulSoup(s) print soup.select('div > span.value')[0].text这将选择div标签的所有直接子span标签,其具有class属性值作为value 。
输出:
dfgd454You could use BeautifulSoup parser.
from bs4 import BeautifulSoup s = '''<html><body><div id="pagecontent"><div id="container"><div id="content"><div id="tab-description"><div id="attributes"> <div class="attr"> <span class="name">Ugug</span> <span class="value">dfgd454</span> </div>''' soup = BeautifulSoup(s) print soup.select('div > span.value')[0].textThis would select all the immediate child span tags of div tag which has the class attribute value as value.
Output:
dfgd454
发布评论