Python Code Library

Convenient for code write

爬虫代码初始化模板

作用：requests爬虫脚本初始化代码

import re
import urllib
import requests
from bs4 import BeautifulSoup

url=""
headers={}
proxies={
   # 'http': 'socks5://xxxx:5555',
   #  'https': 'socks5://xxxx:5555'
}

result = requests.get(url_fix, proxies=proxies, headers=headers)
soup = BeautifulSoup(result.content.decode('utf-8'), 'lxml')
target = soup.find('div')

修正url中的#

作用：防止待爬取目标的url中具有的#等特殊字符对爬虫进行干扰

def sharp_fix(url):
    """
    the sharp (#) will incur some troubles in url
    param: url

    """
    if url.find('#') >= 0:
        strs = url.split('#')
        if is_chinese(strs[1]):
            fix = urllib.parse.quote(strs[1])
            fix = strs[0] + '%23' + fix
            return fix
        return url
    return url

判断字符串否包含中文

作用：对字符串是否包含有中文字符进行判断

def is_chinese(string):
    """
    check whether the string includes the Chinese
    param: string
    """
    for ch in string:
        if u'\u4e00' <= ch <= u'\u9fff':
            return True

    return True

文件(txt)操作初始模板

作用：对文件（文本文件）进行读取写入，此处为先读取当前文件夹里的所有.txt结尾文件的内容，经过特定的函数处理后写入新文本文档

import re
import os


def do_something():
    pass


if __name__ == "__main__":
    file_list = os.listdir()
    target = 'target.txt'

    with open(target, 'a+', encoding='UTF-8') as source: #a+ w+ rb
        for file_name in file_list:
            if file_name.endswith('.txt') and file_name != target:
                with open(file_name, 'r', encoding='UTF-8') as file:
                    print(file_name + 'done')
                    for contents in file.readlines():
                        source.write(do_something(contents))

Page View

Prev Home Next