学习python的日常1

Posted by chinaljr on March 30, 2018

存储对象

import pickle
def _readobj(path):
    with open(path, "rb") as file_obj:
        ans = pickle.load(file_obj)
    return ans
def _saveobj(path,obj)
    with open(path, 'wb') as f:
    pickle.dump(obj, f)
# 注意方式为"wb" "rb" "w" "r"

scipy.sparse.csr.csr_matrix

读入文件错误

f = open("XX.txt", "r").read()
# 'gbk' codec can't decode byte 0xad in position 3: illegal multibyte sequence
f = open("XX.txt", "r",encoding = "utf-8").read()

BeautifulSoup

将html代码按照标签拆分.

网上太多一样的了,根本分不清谁是原创,艹。随便来一个时间算是比较早的。 学习网站

个人认为有点。。不是很简洁。具体学习就不“借鉴”写博客了,写点注意点。

  • .descendants 说的是子孙妈的!早知道这样,为啥不写个删掉文中所有<.*>的代码!!

在学习中文文本分类的时候,用到了。尝试代码如下。不知道怎么确定类型,写法有些笨拙。

def shit(x):
    if str(type(x)) == "<class 'bs4.element.Tag'>":
        res = ""
        for t in x:
            res = res + " " + shit(t)
        return res
    else:
        tmp = x
        tmp = re.sub('(.*)', " ", tmp)
        tmp = re.sub('(.*\)', " ", tmp)
        tmp = re.sub('\(.*)', " ", tmp)
        return re.sub('\(.*\)', " ", tmp)

def dehtml(text):
    soup = BeautifulSoup(text)
    ans = ""
    for child in soup.body.children:
        ans = ans + " " + shit(child)
    ans = ans.replace("\n"," ")
    ans = ans.replace("\t", " ")
    ans = ans.replace("-", " ")
    ans = ans.replace("\r", "。")
    ans = ans.replace("\u3000","。")
    ans = ans.replace("  ", " ")
    return ans

re.sub

re.sub(re_patten,your_string,sourse)

ans = re.sub('(.*)'," ",ans);
ans = re.sub('\(.*\)', " ", ans);

#

enumerate

enumerate(List)
enumerate(['a','b','c','d','e'])
[(0,'a'),(1,'b'),(2,'c'),(3,'d'),(4,'e')]

map

D_List = map(function,S_List)
对S_List中每个元素应用function函数,生成新的D_Lsit

str.replace

ans.replace(xx,xx)

屁用没有的!

ans = ans.replace(xx,xx)

这样才对,他只是返回一个replace的结果。

\r and \n

\r 写入文件会变成 \n