0%

自定义whistle插件抓取抖音账号内视频数据

今天接到任务需要抓取下我司抖音号里所有视频的:点赞数、评论数、转发数 数据。大概是因为临近年终了。。。

抓取流程解析

  1. 开启代理工具

  2. 手机添加代理

  3. 打开抖音App,进入对应抖音账号,开始抓取数据,重复通过手势向上滑动以加载更多数据,直至显示所有视频。

  4. 抓包完毕。

工具准备篇

1
2
3
4
5
6
$ npm i -g whistle whistle.autosave mysql --registry=https://registry.npmmirror.com

# 将源地址改为淘宝源
$ npm config set registry https://registry.npmmirror.com
# 将源地址改为官方源
$ npm config set registry https://registry.npmjs.org
  • whistle 网络数据抓包。
  • whistle.autosave whistle 官方插件,默认将 response 数据记录到文件。对其改造(仅获取需要的字段数据)并存储至数据库。
  • mysql MySQL 数据库操作包,用于将数据存储到数据库。

抓包数据分析

通过 w2 run 启动 whistle 抓包工具。

在抓取数据之前需要找到对应数据 URLresponse 字段。

视频列表
  • https://api3-core-c-hl.amemv.com/aweme/v1/aweme/post/
  • https://api5-core-c-hl.amemv.com/aweme/v1/aweme/post/
视频分享地址
  • https://www.iesdouyin.com/share/video/${aweme_id}

涉及字段

resBodyJSON 格式的 body 数据。

  • resBody.aweme_list[i].statistics.digg_count 点赞数量
  • resBody.aweme_list[i].aweme_id 视频ID
  • resBody.aweme_list[i].desc 视频描述
  • resBody.aweme_list[i].statistics.comment_count 评论数量
  • resBody.aweme_list[i].statistics.share_count 转发数量
  • resBody.aweme_list[i].create_time 发布时间
  • resBody.aweme_list[i].duration 视频时长

添加匹配规则

通过 抓包数据分析 得知,host 是变动的,而剩下的 URI 是固定格式。

浏览器,打开 http://127.0.01:8899/#plugins 地址,添加过滤条件 /aweme\/v1\/aweme\/post/,这时只有匹配到过滤条件的请求才会请求到 whistle.autosave 插件。

创建数据表

1
2
3
4
5
6
7
8
9
10
11
12
13
CREATE TABLE `douyin` (
`id` bigint unsigned NOT NULL AUTO_INCREMENT,
`sid` bigint unsigned NOT NULL DEFAULT '0' COMMENT '视频ID',
`name` varchar(1000) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL DEFAULT '' COMMENT '视频描述',
`digg` int unsigned NOT NULL DEFAULT '0' COMMENT '点赞数量',
`comment` int unsigned NOT NULL DEFAULT '0' COMMENT '评论数量',
`share` int unsigned NOT NULL DEFAULT '0' COMMENT '分享数量',
`duration` int unsigned DEFAULT '0' COMMENT '视频时长',
`create_time` int unsigned DEFAULT '0' COMMENT '发布时间',
`update_time` int unsigned DEFAULT '0' COMMENT '更新时间',
PRIMARY KEY (`id`) USING BTREE,
UNIQUE KEY `id` (`sid`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci;

whistle.autosave 插件改造

1
2
$ cd $(npm root -g)/whistle.autosave/lib
$ vim resStatsServer.js

编辑文件 resStatsServer.js ,代码修改如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
const fs = require('fs');
const path = require('path');
const { check: checkFilter, update: updateFilter } = require('./filter');

const MAX_LENGTH = 10;
const noop = () => {};
var mysql = require('mysql');

var connection = mysql.createConnection({
host : '127.0.0.1',
user : 'root',
password : '1234',
port: '3306',
database: 'test',
charset: 'UTF8MB4_BIN'
});

const formatDate = function (now) {
var year = now.getFullYear();
var month = now.getMonth() + 1;
var date = now.getDate();
var hour = now.getHours();
var minute = now.getMinutes();
var second = now.getSeconds();

return year + "-" + month + "-" + date + " " + hour + ":" + minute + ":" + second;
};

connection.connect();

const insertSql = 'INSERT INTO douyin(sid,name,digg,comment,share,create_time,duration,update_time) VALUES(?,?,?,?,?,?,?,?) ON DUPLICATE KEY UPDATE digg = ?, comment = ?, share = ?, create_time = ?, update_time = ?';

module.exports = (server, { storage }) => {
let sessions = [];
let timer;
const writeSessions = (dir) => {
try {
const text = JSON.stringify(sessions.slice(), null, ' ');
sessions = [];
//dir = path.resolve(dir, `${Date.now()}.txt`);
dir = path.resolve(dir, `浪涨小青岛.txt`);
fs.writeFile(dir, text, (err) => {
if (err) {
fs.writeFile(dir, text, noop);
}
});
} catch (e) {}
};

updateFilter(storage.getProperty('filterText'));
server.on('request', (req) => {
// filter
const active = storage.getProperty('active');
if (!active) {
return;
}
const dir = storage.getProperty('sessionsDir');
if (!dir || typeof dir !== 'string') {
sessions = [];
return;
}
if (!checkFilter(req.originalReq.url)) {
return;
}
req.getSession((s) => {
if (!s) {
return;
}

var resBody = JSON.parse(s.res.body);
//var currentTime = Date.now().toString().substr(0, 10);
var currentTime = parseInt(Date.now() / 1000);
for(var i=0;i<resBody.aweme_list.length; i++) {
console.log(
resBody.aweme_list[i].aweme_id,
resBody.aweme_list[i].create_time,
resBody.aweme_list[i].desc,
resBody.aweme_list[i].statistics.digg_count,
resBody.aweme_list[i].statistics.comment_count,
resBody.aweme_list[i].statistics.share_count,
formatDate(
new Date(parseInt(resBody.aweme_list[i].create_time + "000"))
)
);

var insertSqlData = [
resBody.aweme_list[i].aweme_id,
resBody.aweme_list[i].desc,
resBody.aweme_list[i].statistics.digg_count,
resBody.aweme_list[i].statistics.comment_count,
resBody.aweme_list[i].statistics.share_count,
resBody.aweme_list[i].create_time,
resBody.aweme_list[i].duration,
currentTime,
// update
resBody.aweme_list[i].statistics.digg_count,
resBody.aweme_list[i].statistics.comment_count,
resBody.aweme_list[i].statistics.share_count,
resBody.aweme_list[i].create_time,
currentTime,
];
connection.query(insertSql, insertSqlData, function (err, result) {
if(err){
console.log('[SQL ERROR]', err.message);
return;
}

console.log('INSERT OR UPDATE:', result.insertId, result.affectedRows);
});

});
});
};

启动抓包程序

1
$ w2 run

蝉妈妈

我同事之前也做过类似的事情,但通过的是第三方平台再处理的数据。跟同事要来了一份数据,对比下第三方平台数据与原数据的差异:

  1. 确认蝉妈妈90天数据,有无丢失视频
1
2
3
4
5
6
7
8
9
10
11
12
13
SELECT
*
FROM
douyin
WHERE
-- 推算今天(2021-01-12) 近90天日期:(2020-10-14 00:00:00)
sid NOT IN ( SELECT aweme_id FROM dylist WHERE NAME = 'xxx' AND `publish_time` > 1602604800 )
-- 今日抓取蝉妈妈最后一个视频发布时间: (2021-01-12 05:33:55)
AND create_time <= 1610400835 AND create_time > 1602604800
ORDER BY
create_time DESC;

-- 共丢失 184 条视频,最近一条视频发布时间:2020-12-17 09:30:48


2. 计算差异

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
-- select count(*) from dylist where name = 'xxx' and `publish_time` > 1602604800;
-- 4419
-- select count(*) from douyin where create_time <= 1610400835 AND create_time > 1602604800;
-- 4603

SELECT
a.NAME,
a.`comment`,
a.digg as "点赞A",
a.`comment` as "评论",
a.SHARE as "转发",
b.digg_count as "点赞B-蝉妈妈",
b.comment_count as "评论-蝉妈妈",
b.share_count as "转发-蝉妈妈",
( CONVERT ( a.digg, SIGNED ) - CONVERT ( b.digg_count, SIGNED ) ) AS "点赞差(A-B)",
( CONVERT ( a.comment, SIGNED ) - CONVERT ( b.comment_count, SIGNED ) ) AS "评论差",
( CONVERT ( a.share, SIGNED ) - CONVERT ( b.share_count, SIGNED ) ) AS "转发差",
FROM_UNIXTIME( a.create_time, '%Y-%m-%d %H' )
FROM
douyin AS a
right JOIN dylist AS b ON a.sid = b.aweme_id
WHERE
b.NAME = 'xxx'
-- 今日抓取蝉妈妈最后一个视频发布时间: (2021-01-12 05:33:55)
AND a.create_time <= 1610400835
AND a.create_time > 1602604800
-- 推算今天(2021-01-12) 近90天日期:(2020-10-14 00:00:00)
AND `publish_time` > 1602604800;
ORDER BY a.create_time desc;
总结
  1. 第三方平台并不只是抓取 90 天之内的视频,有可能只是记录了,并显示出对应视频数据,在同事抓取时仅做了更新操作,并未对有差异的再做更新。
  2. 第三方平台视频数据缺少。
  3. 第三方平台可能对历史视频数据不再处理了,导致点赞数、评论数等数据统计不准确。

后记

部分视频可以直接通过 URL 获取,有时限(过一段时间后将无法获取数据)。

  • https://api3-core-c-hl.amemv.com/aweme/v1/aweme/post/?page_from=1&user_id=3293759164133159&publish_video_strategy_type=2

  • https://api3-core-c-hl.amemv.com/aweme/v1/aweme/post/?page_from=2&user_id=68310389333&publish_video_strategy_type=2

  • https://api3-core-c-hl.amemv.com/aweme/v1/aweme/post/?count=21&publish_video_strategy_type=2&page_from=2&user_id=566969479994331

参考