0%

使用python3解析html网页表格

使用python3解析html网页表格

实习实在是太无聊了,看书是看不进去的,跟着老师对着那些宛如ZZ的上古软件点点点那更是不可能,于是想起之前找学弟PY,想解析一下课程表的页面,无果,于是就写着玩玩。虽然弄出来好像也没啥卵用……

需求

使用python3解析html网页,并输出成json结构树

参考代码:

1
https://github.com/schmijos/html-table-parser-python3

这里是一个外国小哥哥写的解析代码,测试了一下可以解析,但是只能解析简单表,对于复杂表格他这里没有处理。

问题

先上图:

image-20191024132057056.png

原来的程序没有考虑合并单元格的情况,导致复杂表的解析结果错位,

所以需要在原来的基础上对跨行或跨列的情况进行处理。

那么就在这个基础上写代码吧~(自己造轮子是永远不可能造轮子的)

code

由于代码比较短,所以直接贴上来,我的改动都在注释里

parser.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
# -----------------------------------------------------------------------------
# Name: html_table_parser
# Purpose: Simple class for parsing an (x)html string to extract tables.
# Written in python3
#
# Author: Josua Schmid
#
# Created: 05.03.2014
# Copyright: (c) Josua Schmid 2014
# Licence: AGPLv3
#
# ChangeLog: Add logic to handle rowspan and colSpan 2019.10.24
# Author: guiu
# -----------------------------------------------------------------------------

from html.parser import HTMLParser


class HTMLTableParser(HTMLParser):
""" This class serves as a html table parser. It is able to parse multiple
tables which you feed in. You can access the result per .tables field.
"""
def __init__(
self,
decode_html_entities=False,
data_separator=' ',
):

HTMLParser.__init__(self)

self._parse_html_entities = decode_html_entities
self._data_separator = data_separator

self._in_td = False
self._in_th = False
self._current_table = []
self._current_row = []
self._current_cell = []
self.tables = []

""" 添加两个标志位,处理rowspan和colspan """
self.row_flag = 0
self.col_flag = 0

def handle_starttag(self, tag, attrs):
""" We need to remember the opening point for the content of interest.
The other tags (<table>, <tr>) are only handled at the closing point.
"""
if tag == 'td':
""" 这里判断有没有跨行(或跨列)的情况,并将值添加到标志位
需要注意的是,跨行的时候,这里只处理第一列的情况,因为学籍页面只有第一列有rowspan
其余列稍微复杂一些,就暂时不做
"""
for i in attrs:
if i[0] == 'rowspan' and i[1]:
self.row_flag = int(i[1])
if i[0] == 'colspan' and i[1]:
self.col_flag = int(i[1])

self._in_td = True
if tag == 'th':
self._in_th = True

def handle_data(self, data):
""" This is where we save content to a cell """
if self._in_td or self._in_th:
self._current_cell.append(data.strip())

def handle_charref(self, name):
""" Handle HTML encoded characters """

if self._parse_html_entities:
self.handle_data(self.unescape('&#{};'.format(name)))

def handle_endtag(self, tag):
""" Here we exit the tags. If the closing tag is </tr>, we know that we
can save our currently parsed cells to the current table as a row and
prepare for a new row. If the closing tag is </table>, we save the
current table and prepare for a new one.
"""
if tag == 'td':
self._in_td = False
elif tag == 'th':
self._in_th = False

if tag in ['td', 'th']:
final_cell = self._data_separator.join(self._current_cell).strip()
self._current_row.append(final_cell)

""" 跨列的时候,旁边的相应列补上 '-', 表示跨列 """
if self.col_flag:
for i in range(self.col_flag - 1):
self._current_row.append('-')
self.col_flag = 0

self._current_cell = []

elif tag == 'tr':
self._current_table.append(self._current_row)
self._current_row = []

""" 首列跨行,所以就在第二行开始初始化一个 '-' """
if self.row_flag - 1 >= 1:
self._current_row = ['-']
self.row_flag = self.row_flag - 1

elif tag == 'table':
self.tables.append(self._current_table)
self._current_table = []

example_of_usage.py:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# -----------------------------------------------------------------------------
# Created: 04.03.2014
# Copyright: (c) Josua Schmid 2014
# Licence: AGPLv3
#
# Sample script for parsing HTML tables
# -----------------------------------------------------------------------------

import urllib.request
from pprint import pprint
from html_table_parser import HTMLTableParser


def url_get_contents(url):
""" Opens a website and read its binary contents (HTTP Response Body) """
req = urllib.request.Request(url=url)
f = urllib.request.urlopen(req)
return f.read()


def main():
# url = 'http://guiu.xyz/StuProductionSchedule.html'
# xhtml = url_get_contents(url).decode('utf-8')

with open('./StuProductionSchedule.html', encoding='utf-8') as f:
xhtml = f.read()

p = HTMLTableParser()
p.feed(xhtml)
fff = p.tables
pprint(p.tables)


if __name__ == '__main__':
main()

小哥还给了输出csv的选项:

1
python3 ./html_table_converter -u http://guiu.xyz/StuProductionSchedule.html -o metaltrain

结果:

image-20191024133822154.png

效果完美。

下一步直接将代码输出的list转为json文件就ok了,下课了,鸽一手。