用Python分析微信聊天记录

2021年的跨年夜，无心学习，女朋友得加班没空陪我，我想总得做一点什么有意义的事情吧，想想我们俩在一起已经半年了，突发奇想写了一个简单的程序分析了我们这半年以来的微信聊天记录，也许会有什么意想不到的惊喜 :heart::heart::heart:

大概有两个步骤，一是提取出微信的聊天记录并转换成txt文件，二是分析这个txt文件。

提取微信聊天记录

我开始想的是用微信PC端自带的聊天记录备份功能，但是发现备份出来的文件是加密的打不开，于是采取了以下办法（以下办法仅对iPhone生效）：

在PC上使用iTunes对iPhone进行备份，选择备份到本电脑，注意不要选择加密
使用GitHub上的WechatExport-iOS（https://github.com/stomakun/WechatExport-iOS ）工具导出微信聊天记录txt

聊天记录文本分析

首先是统计了一下我们发的消息数：

# -*- coding: utf-8 -*-
# @Author  : Hugo Wang
# @Time    : 2021/12/31 17:02
# @IDE     : PyCharm
# @Function:

import re

me_sent_count = 0
she_sent_count = 0

with open("FutabaRio_.txt", encoding='utf-8') as record:
    for text in record.readlines():
        if re.match(r"老婆～✨", text):
            she_sent_count  = she_sent_count + 1
        elif re.match("槐十", text):
            me_sent_count = me_sent_count + 1

print(me_sent_count, she_sent_count)

然后就开始了聊天记录分析，我是使用的jieba库进行的中文分词,然后统计词频绘制了词云：

# -*- coding: utf-8 -*-
# @Author  : Hugo Wang
# @Time    : 2021/12/31 17:23
# @IDE     : PyCharm
# @Function:

import jieba
import collections
import re
from wordcloud import WordCloud
import matplotlib.pyplot as plt


with open('***.txt', encoding='utf-8') as f:  # ***.txt为聊天记录文本
    data = f.read()

new_data = re.findall('[\u4e00-\u9fa5]+', data, re.S)
new_data = " ".join(new_data)

# 文本分词
seg_list_exact = jieba.cut(new_data, cut_all=True)

result_list = []
with open('stop_words.txt', encoding='utf-8') as f:
    con = f.readlines()
    stop_words = set()
    for i in con:
        i = i.replace("\n", "")   # 去掉读取每一行数据的\n
        stop_words.add(i)

for word in seg_list_exact:
    # 设置停用词并去除单个词
    if word not in stop_words and len(word) > 1:
        result_list.append(word)
print(result_list)

# 筛选后统计
word_counts = collections.Counter(result_list)
# 获取前100最高频的词
word_counts_top100 = word_counts.most_common(200)
print(word_counts_top100)

# 绘制词云
my_cloud = WordCloud(
    background_color='white',
    width=900, height=600,
    # max_words=100,
    font_path='WenQuanWeiMiHei-1.ttf',
    # max_font_size=99,
    # min_font_size=16,
    # random_state=50
).generate_from_frequencies(word_counts)

# 显示生成的词云图片
plt.imshow(my_cloud, interpolation='bilinear')
# 显示设置词云图中无坐标轴
plt.axis('off')
plt.show()

结果

我发的消息数：49668
她发的消息数：41673

（这难道不是势均力敌的爱情嘛）

词云：

好像暴露了什么东西，看着这个每天聊了什么好像都记起来了，另外，我果然是老婆至上主义者（doge）

结果里还有很多连接词或者语气词，可以把它们加到停止词里。