使用预格式化文本并且没有标记来抓取.htm页面(Web scraping a .htm page withpreformatted text and no tags)

使用预格式化文本并且没有标记来抓取.htm页面(Web scraping a .htm page with preformatted text and no tags)

我是一名网络抓取项目的新手。我需要将这些选举结果放入数据框（或Excel）中以进行分析。

最棘手的是它是一个.htm文件，其中所有数据都是“预格式化文本”（PRE）标签之间的一个大文本块，而数据本身没有单独的标签。我只对像表格这样设置的数据部分感兴趣：

https://www.stlouisco.com/portals/8/docs/document%20library/elections/eresults/el140805/EXEC.htm

我一直在使用BeautifulSoup在Python中尝试它。但是，如果您在URL上查看源代码，您可以看到为什么BeautifulSoup不会让我走得太远 - 因为数据不是使用标签构建的。结构看起来像这样，基本上：

<html> <pre> COUNTY EXECUTIVE PRIMARY ELECTION OFFICIAL FINAL RESULTS ST. LOUIS COUNTY, MISSOURI RUN DATE:08/18/14 01:20 PM TUESDAY, AUGUST 5, 2014 STATISTICS WITH 681 OF 681 PRECINCTS REPORTING TOTAL PERCENT TOTAL PERCENT 01 = REGISTERED VOTERS - TOTAL 661,393 05 = BALLOTS CAST - LIBERTARIAN 1,121 .58 02 = BALLOTS CAST - TOTAL 192,495 06 = BALLOTS CAST - CONSTITUTION 314 .16 03 = BALLOTS CAST - DEMOCRATIC 129,918 67.49 07 = BALLOTS CAST - NONPARTISAN 6,225 3.23 04 = BALLOTS CAST - REPUBLICAN 54,917 28.53 08 = VOTER TURNOUT - TOTAL 29.10 - - - - - - - - - - - - - - - - - - - - - - - - 01 02 03 04 05 06 07 08 - - - - - - - - - - - - - - - - - - - - - - - - 0101 AP1,2,7,43 1317 . 298 . 214 . 69 . . 3 . . 1 . 11 22.63 0103 AP3,27 NRW2,8,15,29 1453 . 186 . 179 . . 5 . . 1 . . 0 . . 1 12.80 0104 AP4 231 . 51 . 34 . . 4 . . 0 . . 0 . 13 22.08 0105 AP5,18,21,39 1289 . 268 . 198 . 47 . . 4 . . 1 . 18 20.79 0106 AP6 2 . . 1 . . 0 . . 0 . . 0 . . 0 . . 1 50.00 0108 AP8,20 586 . 142 . 86 . 44 . . 4 . . 0 . . 8 24.23 0109 AP9,25 533 . 119 . 85 . 29 . . 2 . . 3 . . 0 22.33 0110 AP10 1044 . 158 . 114 . 34 . . 2 . . 0 . . 8 15.13 ... 2832 WH32,38,44 296 . 51 . 23 . 28 . . 0 . . 0 . . 0 17.23 2834 WH34,43 2043 . 609 . 267 . 321 . . 1 . . 0 . 20 29.81 2835 WH35 543 . 173 . 60 . 110 . . 0 . . 0 . . 3 31.86 ==================================================================================================================================== (DEMOCRATIC) WITH 681 OF 681 REPORTING VOTES PERCENT VOTES PERCENT COUNTY EXECUTIVE (Vote for ) 1 01 = CHARLIE A. DOOLEY 39,038 30.52 02 = STEVE STENGER 84,993 66.46 03 = RONALD E. LEVY 3,862 3.02 ------------------ 01 02 03 ------------------ 0101 AP1,2,7,43 59 134 19 0103 AP3,27 NRW2,8,15,29 154 18 5 0104 AP4 7 25 2 0105 AP5,18,21,39 55 133 9 0106 AP6 0 0 0 0108 AP8,20 28 50 7 0109 AP9,25 21 57 6 0110 AP10 56 54 1 0111 AP11,24 53 54 1 0112 AP12 19 41 1 0113 AP13 23 46 2 0114 AP14,15,16 NOR31 25 56 4 ... 2819 WH19,20,22 25 162 7 2825 WH25 17 109 9 2831 WH31 18 112 7 2832 WH32,38,44 0 22 1 2834 WH34,43 31 218 10 2835 WH35 16 41 3 ==================================================================================================================================== (REPUBLICAN) WITH 681 OF 681 REPORTING VOTES PERCENT COUNTY EXECUTIVE (Vote for ) 1 01 = TONY POUSOSA 16,439 32.10 02 = RICK STREAM 34,772 67.90 ------------ 01 02 ------------ 0101 AP1,2,7,43 24 37 0103 AP3,27 NRW2,8,15,29 1 4 0104 AP4 1 3 0105 AP5,18,21,39 13 28 0106 AP6 0 0 0108 AP8,20 16 28 0109 AP9,25 9 19 0110 AP10 13 19 0111 AP11,24 7 32 ... </pre> <p>Some closing text that is irrelevant to this project.</p> </html>

我希望使用Python来自动执行此过程，以便我可以在其他类似的选举结果网页上运行它。

这是我能够得到的。我能够创建一个对象列表，每个列表项都是一行数据。我希望它成为一个包含所有额外空格和句点的数据框。不过，我不知道怎么从这里做到这一点。我想我甚至可能从错误的角度思考这个问题。

# STEP 1: Importing the Libraries import requests from bs4 import BeautifulSoup # STEP 2: Collecting and Parsing the webpage # Collect the election results page page = requests.get('https://www.stlouisco.com/portals/8/docs/document%20library/elections/eresults/el140805/EXEC.htm') # Parse the page and create a Beautiful Soup object soup = BeautifulSoup(page.text, 'html.parser') # STEP 3: Create an object with just the text soup2 = soup.text # Split the text at each line break \n; this creates a list object [x.strip() for x in soup2.split('\n')] Output: [... '0212 BON12 1678 . 685 . 376 . 295 . . 1 . . 0 . 13 40.82', '0213 BON13,23,26,29 2174 . 796 . 500 . 261 . . 3 . . 2 . 30 36.61', '0214 BON14 17 . . 4 . . 0 . . 0 . . 0 . . 0 . . 4 23.53', '0215 BON15 1340 . 369 . 224 . 129 . . 2 . . 1 . 13 27.54', '0216 BON16 204 . 104 . 68 . 36 . . 0 . . 0 . . 0 50.98', '0217 BON17 589 . 93 . 71 . 16 . . 1 . . 0 . . 5 15.79', '0218 BON18 195 . 48 . 28 . 17 . . 0 . . 1 . . 2 24.62', '0219 BON19 CLA15 1340 . 443 . 255 . 172 . . 5 . . 0 . 11 33.06', ...]

我感到困惑，我非常感谢任何建议！（如果Python不是将其自动化到数据帧中的最佳方式......我也欢迎这些反馈。）谢谢。

I'm a newbie working on a web scraping project. I need to get these election results into a dataframe (or Excel) in order to analyze it.

What's been most tricky is that it is a .htm file with all the data as one big text block in between "Preformatted Text" (PRE) tags, and no individual tags on the data itself. I am only interested in the parts of the data that are set up like tables:

https://www.stlouisco.com/portals/8/docs/document%20library/elections/eresults/el140805/EXEC.htm

I have been attempting it in Python with BeautifulSoup. However, if you view the source code at the URL, you can see why BeautifulSoup isn't getting me very far -- because the data isn't structured using tags. The structure looks like this, basically:

I am hoping to use Python to automate this process so I can run it on other similar webpages of election results.

Here is as far as I've been able to get. I was able to create a list of objects with each list item being one line of the data. I would like it to become a data frame with all the extra spaces and periods stripped out. I'm not sure how to do that from here, though. I imagine I may even be thinking about this from the wrong angle.

I feel stuck and I would greatly appreciate any advice! (And if Python isn't the best way to automate getting this into a dataframe... I welcome that feedback too.) Thank you.

最满意答案

看一下你引用的例子，你需要编写一个解析器，因为你的数据很复杂，并且每行都有所不同（很可能是每一页）。

以这一行为例，希望我能解释一下原因：

0101 AP1,2,7,43 1317 . 298 . 214 . 69 . . 3 . . 1 . 11 22.63 这部分： 0101在每一行中相对一致，因为这似乎是某种零填充的整数索引。接下来是1个空格。但是，下一部分（ AP1,2,7,43 ）遵循某些规则，但其内容会有所不同。例如，我们知道逗号分隔值的数量在每一行中是不同的，并且值有时它可以包含空格（例如AP3,27 NRW2,8,15,29 ）。然后是下一节之前的大量空白 - 即看似投票的数字。对于这些数字/整数列，每列用空格分隔，后跟一个点和空格的组合分隔符。如果整数小于10，则填充数字，使". "分隔符重复并放置在数百个位置。最后一列22.63是一个带有2位小数的常规浮点数。

这还没有触及其他每条都有自己规则的行。

鉴于您的数据集的复杂性，您最好使用像pyparsing或PLY这样的工具编写一个简单的语法来创建可以自动从每一行中提取信息的迷你解析器，然后可以将这些信息放在数据结构中并保存到数据帧。使用适用于此处的pyparsing的一个很好的示例显示了如何解析街道地址。更多例子可以在这里找到。

值得注意的是，所有这些都可以通过编写自定义文本操作函数和代码来处理，但鉴于您打算自动化事物，解析器是您最好的选择，因为它可以重用并且更具适应性。

Looking at the example you've cited, you'll need to write a parser since your data is complex and varies across each line (and most likely, each page).

Using this line as an example, hopefully I can explain why:

0101 AP1,2,7,43 1317 . 298 . 214 . 69 . . 3 . . 1 . 11 22.63 This part: 0101 is relatively consistent across each line, as this appears to be some sort of integer index that's zero-padded. This is followed by 1 space. However, the next portion (AP1,2,7,43) follows certain rules but its content varies. For e.g., we know that the number of comma-separated values varies across each line, and that the values sometimes it can contain whitespace (e.g. AP3,27 NRW2,8,15,29). This is then followed by a lot of whitespace up to the next section - i.e. what appears to be voting numbers. For these columns of numbers / integers, each column is separated by a whitespace followed by a combined separator of a dot and a space. If the integers are less than 10, the number is padded such that the ". " delimiter is repeated and placed in the hundreds position. The last column, 22.63 is a regular floating point number with 2 decimal places.

This does not yet touch the other lines which each have their own rules.

Given the complexity of your dataset, you're better off writing a simple grammar using tools like pyparsing or PLY to create mini-parsers that can automatically extract the information from each line, which can then be placed in a data-structure and saved to a dataframe. A good example using pyparsing which is applicable here, shows how to parse street addresses. More examples can be found here.

Notably, all of this could be dealt with by writing custom text manipulation functions and code, but given that you intend to automate things, a parser is your best bet since it will be reusable and more adaptable.

使用预格式化文本并且没有标记来抓取.htm页面(Web scraping a .htm page with preformatted text and no tags)

我是一名网络抓取项目的新手。我需要将这些选举结果放入数据框（或Excel）中以进行分析。

https://www.stlouisco.com/portals/8/docs/document%20library/elections/eresults/el140805/EXEC.htm

我希望使用Python来自动执行此过程，以便我可以在其他类似的选举结果网页上运行它。

我感到困惑，我非常感谢任何建议！（如果Python不是将其自动化到数据帧中的最佳方式......我也欢迎这些反馈。）谢谢。

I'm a newbie working on a web scraping project. I need to get these election results into a dataframe (or Excel) in order to analyze it.

https://www.stlouisco.com/portals/8/docs/document%20library/elections/eresults/el140805/EXEC.htm

I am hoping to use Python to automate this process so I can run it on other similar webpages of election results.

I feel stuck and I would greatly appreciate any advice! (And if Python isn't the best way to automate getting this into a dataframe... I welcome that feedback too.) Thank you.

最满意答案

看一下你引用的例子，你需要编写一个解析器，因为你的数据很复杂，并且每行都有所不同（很可能是每一页）。

以这一行为例，希望我能解释一下原因：

这还没有触及其他每条都有自己规则的行。

Looking at the example you've cited, you'll need to write a parser since your data is complex and varies across each line (and most likely, each page).

Using this line as an example, hopefully I can explain why:

This does not yet touch the other lines which each have their own rules.

使用预格式化文本并且没有标记来抓取.htm页面(Web scraping a .htm page withpreformatted text and no tags)

最满意答案

最满意答案

发布评论取消回复

最近发表

相关推荐

标签列表

使用预格式化文本并且没有标记来抓取.htm页面(Web scraping a .htm page withpreformatted text and no tags)

最满意答案

最满意答案

发布评论 取消回复

最近发表

相关推荐

标签列表

发布评论取消回复