Convenient for code write

爬虫代码初始化模板


作用:requests爬虫脚本初始化代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import re
import urllib
import requests
from bs4 import BeautifulSoup

url=""
headers={}
proxies={
# 'http': 'socks5://xxxx:5555',
# 'https': 'socks5://xxxx:5555'
}

result = requests.get(url_fix, proxies=proxies, headers=headers)
soup = BeautifulSoup(result.content.decode('utf-8'), 'lxml')
target = soup.find('div')

修正url中的#


作用:防止待爬取目标的url中具有的#等特殊字符对爬虫进行干扰

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def sharp_fix(url):
"""
the sharp (#) will incur some troubles in url
param: url

"""
if url.find('#') >= 0:
strs = url.split('#')
if is_chinese(strs[1]):
fix = urllib.parse.quote(strs[1])
fix = strs[0] + '%23' + fix
return fix
return url
return url

判断字符串否包含中文


作用:对字符串是否包含有中文字符进行判断

1
2
3
4
5
6
7
8
9
10
def is_chinese(string):
"""
check whether the string includes the Chinese
param: string
"""
for ch in string:
if u'\u4e00' <= ch <= u'\u9fff':
return True

return True

文件(txt)操作初始模板


作用:对文件(文本文件)进行读取写入,此处为先读取当前文件夹里的所有.txt结尾文件的内容,经过特定的函数处理后写入新文本文档

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import re
import os


def do_something():
pass


if __name__ == "__main__":
file_list = os.listdir()
target = 'target.txt'

with open(target, 'a+', encoding='UTF-8') as source: #a+ w+ rb
for file_name in file_list:
if file_name.endswith('.txt') and file_name != target:
with open(file_name, 'r', encoding='UTF-8') as file:
print(file_name + 'done')
for contents in file.readlines():
source.write(do_something(contents))
⬆︎UP