谷歌浏览器直接提取的xpath,在python中无法提取相应内容
chrome拷贝的xpath里面会添加多余的tbody!!
- 问题:
- 详细解释:Scrapy
原因
- 因为浏览器对不标准的HTML文档都有纠正功能,而lxml不会查看page source,注意是源代码,不是developer tool那个;
- 最后一个table并没有包含tbody,浏览器会自动补充tbody,而lxml没有这么做,所以你的xpath没有找到
问题栗子~
1
2
3
4
5
6
7
8
9
10
11# -*- coding: utf-8 -*-
from lxml import etree
import requests
url='http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2014/'
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36'}
html=requests.get(url,headers=headers)
html.encoding='GBK'
selector = etree.HTML(html.text)
content=selector.xpath('//html/body/table[2]/tbody/tr[1]/td/table/tbody/tr[2]/td/table/tbody/tr/td/table/tbody/tr/td/a/text()')
for each in content:
print each代码中的其他问题!
代码中xpath太长,容易出错!
可以改成:1
2
3
4
5
6
7
8
9
10
11from lxml import etree
import requests
url='http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2014/'
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36'}
html=requests.get(url,headers=headers)
html.encoding='GBK'
selector = etree.HTML(html.text)
nodes=selector.xpath('//tr[@class="provincetr"]/node()')
for each in nodes:
print(each.xpath('string()'))