网站首页 > 技术教程正文

文本分析了4000万条Stack Overflow讨论帖，这些是程序员最推荐的编程书(附代码)

xnh888 2024-09-22 17:28:24 技术教程 51 ℃ 0 评论

大数据文摘作品

作者：Vlad Wetzel

编译：xixi、吴双、Aileen

程序员们都看什么书？他们会向别人推荐哪些书？

本文作者分析了Stack Overflow上的4000万条问答，找出了程序员们最常讨论的书，同时非常慷慨地公开了数据分析代码。让我们来看看作者是怎么说的吧。

寻找下一本值得读的编程书是一件很难，而且有风险的事情。

作为一个开发者，你的时间是很宝贵的，而看书会花费大量的时间。这时间其实你本可以用来去编程，或者是去休息，但你却决定将其用来读书以提高自己的能力。

所以，你应该选择读哪本书呢？我和同事们经常讨论看书的问题，我发现我们对于书的看法相差很远。

幸运的是，Stack Exchange（程序员最常用的IT技术问答网站Stack Overflow的母公司）发布了他们的问答数据。用这些数据，我找出了Stack Overflow上4000万条问答里，被讨论最多的编程书籍，一共5720本。

在这篇文章里，我将详细介绍数据获取及分析过程，附有代码。

“被推荐次数最多的书是Working Effectively with Legacy Code《修改代码的艺术》，其次是Design Pattern: Elements of Reusable Object-Oriented Software《设计模式：可复用面向对象软件的基础》。

虽然它们的名字听起来枯燥无味，但内容的质量还是很高的。你可以在每种标签下将这些书依据推荐量排序，如JavaScript, C, Graphics等等。这显然不是书籍推荐的终极方案，但是如果你准备开始编程或者提升你的知识，这是一个很好的开端。”
——来自Lifehacker.com的评论

获取和输入数据

我从archive.org抓取了Stack Exchange的数据。（https://archive.org/details/stackexchange）

从最开始我就意识到用最常用的方式（如 myxml := pg_read_file(‘path/to/my_file.xml’)）输入48GB的XML文件到一个新建立的数据库（PostgreSQL）是不可能的，因为我没有48GB的RAM在我的服务器上，所以我决定用SAX程序。

所有的值都被储存在<row>这个标签之间，我用Python来提取这些值：

def startElement(self, name, attributes): 
if name == ‘row’: 
self.cur.execute(“INSERT INTO posts (Id, Post_Type_Id, Parent_Id, Accepted_Answer_Id, Creation_Date, Score, View_Count, Body, Owner_User_Id, Last_Editor_User_Id, Last_Editor_Display_Name, Last_Edit_Date, Last_Activity_Date, Community_Owned_Date, Closed_Date, Title, Tags, Answer_Count, Comment_Count, Favorite_Count) VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)”, 
( 
(attributes[‘Id’] if ‘Id’ in attributes else None), 
(attributes[‘PostTypeId’] if ‘PostTypeId’ in attributes else None), 
(attributes[‘ParentID’] if ‘ParentID’ in attributes else None), 
(attributes[‘AcceptedAnswerId’] if ‘AcceptedAnswerId’ in attributes else None), 
(attributes[‘CreationDate’] if ‘CreationDate’ in attributes else None), 
(attributes[‘Score’] if ‘Score’ in attributes else None), 
(attributes[‘ViewCount’] if ‘ViewCount’ in attributes else None), 
(attributes[‘Body’] if ‘Body’ in attributes else None), 
(attributes[‘OwnerUserId’] if ‘OwnerUserId’ in attributes else None), 
(attributes[‘LastEditorUserId’] if ‘LastEditorUserId’ in attributes else None), 
(attributes[‘LastEditorDisplayName’] if ‘LastEditorDisplayName’ in attributes else None), 
(attributes[‘LastEditDate’] if ‘LastEditDate’ in attributes else None), 
(attributes[‘LastActivityDate’] if ‘LastActivityDate’ in attributes else None), 
(attributes[‘CommunityOwnedDate’] if ‘CommunityOwnedDate’ in attributes else None), 
(attributes[‘ClosedDate’] if ‘ClosedDate’ in attributes else None), 
(attributes[‘Title’] if ‘Title’ in attributes else None), 
(attributes[‘Tags’] if ‘Tags’ in attributes else None), 
(attributes[‘AnswerCount’] if ‘AnswerCount’ in attributes else None), 
(attributes[‘CommentCount’] if ‘CommentCount’ in attributes else None), 
(attributes[‘FavoriteCount’] if ‘FavoriteCount’ in attributes else None) 
) 
);

在数据输入进行了三天之后（有将近一半的XML在这段时间内已经被导入了），我发现我犯了一个错误：我把“ParentId”写成了“ParentID”。

但这个时候，我不想再多等一周，所以把处理器从AMD E-350 (2 x 1.35GHz)换成了Intel G2020 (2 x 2.90GHz)，但这并没能加速进度。

下一个决定——批量输入：

class docHandler(xml.sax.ContentHandler): 
def __init__(self, cusor): 
self.cusor = cusor; 
self.queue = 0; 
self.output = StringIO; 
def startElement(self, name, attributes): 
if name == ‘row’: 
self.output.write( 
attributes[‘Id’] + '\t` + 
(attributes[‘PostTypeId’] if ‘PostTypeId’ in attributes else '\\N') + '\t' + 
(attributes[‘ParentId’] if ‘ParentId’ in attributes else '\\N') + '\t' + 
(attributes[‘AcceptedAnswerId’] if ‘AcceptedAnswerId’ in attributes else '\\N') + '\t' + 
(attributes[‘CreationDate’] if ‘CreationDate’ in attributes else '\\N') + '\t' + 
(attributes[‘Score’] if ‘Score’ in attributes else '\\N') + '\t' + 
(attributes[‘ViewCount’] if ‘ViewCount’ in attributes else '\\N') + '\t' + 
(attributes[‘Body’].replace('\\', '\\\\').replace('\n', '\\\n').replace('\r', '\\\r').replace('\t', '\\\t') if ‘Body’ in attributes else '\\N') + '\t' + 
(attributes[‘OwnerUserId’] if ‘OwnerUserId’ in attributes else '\\N') + '\t' + 
(attributes[‘LastEditorUserId’] if ‘LastEditorUserId’ in attributes else '\\N') + '\t' + 
(attributes[‘LastEditorDisplayName’].replace('\n', '\\n') if ‘LastEditorDisplayName’ in attributes else '\\N') + '\t' + 
(attributes[‘LastEditDate’] if ‘LastEditDate’ in attributes else '\\N') + '\t' + 
(attributes[‘LastActivityDate’] if ‘LastActivityDate’ in attributes else '\\N') + '\t' + 
(attributes[‘CommunityOwnedDate’] if ‘CommunityOwnedDate’ in attributes else '\\N') + '\t' + 
(attributes[‘ClosedDate’] if ‘ClosedDate’ in attributes else '\\N') + '\t' + 
(attributes[‘Title’].replace('\\', '\\\\').replace('\n', '\\\n').replace('\r', '\\\r').replace('\t', '\\\t') if ‘Title’ in attributes else '\\N') + '\t' + 
(attributes[‘Tags’].replace('\n', '\\n') if ‘Tags’ in attributes else '\\N') + '\t' + 
(attributes[‘AnswerCount’] if ‘AnswerCount’ in attributes else '\\N') + '\t' + 
(attributes[‘CommentCount’] if ‘CommentCount’ in attributes else '\\N') + '\t' + 
(attributes[‘FavoriteCount’] if ‘FavoriteCount’ in attributes else '\\N') + '\n' 
); 
self.

StringIO让你可以用一个文件作为变量来执行copy_from这个函数，这个函数可以执行COPY（复制）命令。用这个方法，执行所有的输入过程只需要一个晚上。

好，是时候创建索引了。理论上，GiST Indexes会比GIN慢，但它占用更少的空间，所以我决定用GiST。又过了一天，我得到了70GB的加了索引的数据。

在试了一些测试语句后，我发现处理它们会花费大量的时间。至于原因，是因为Disk IO需要等待。使用SSD GOODRAM C40 120Gb会有很大提升，尽管它并不是目前最快的SSD。

我创建了一组新的PostgreSQL族群：

initdb -D /media/ssd/postgresq/data

然后确认改变路径到我的config服务器（我之前用Manjaro OS）：

vim /usr/lib/systemd/system/postgresql.service

Environment=PGROOT=/media/ssd/postgres

PIDFile=/media/ssd/postgres/data/postmaster.pid

重新加载config并且启动postgreSQL：

systemctl daemon-reload postgresql systemctl startpostgresql

这次输入数据用了几个小时，但我用了GIN（来添加索引）。索引在SSD上占用了20GB的空间，但是简单的查询仅花费不到一分钟的时间。

从数据库提取书籍

数据全部输入之后，我开始查找提到这些书的帖子，然后通过SQL把它们复制到另一张表：

CREATE TABLE books_posts AS SELECT * FROM posts WHERE body LIKE ‘%book%’”;

下一步是找的对应帖子的连接：

CREATE TABLE http_books AS SELECT * posts WHERE body LIKE ‘%http%’”;

但这时候我发现StakOverflow代理的所有链接都如下所示:

rads.stackowerflow.com/[$isbn]/

于是，我建立了另一个表来保存这些连接和帖子：

CREATE TABLE rads_posts AS SELECT * FROM posts WHERE body LIKE ‘%http://rads.stackowerflow.com%'";

我使用常用的方式来提取所有的ISBN（国际标准书号），并通过下图方式提取StackOverflow的标签到另外一个表：

regexp_split_to_table

当我有了最受欢迎的标签并且做了统计后，我发现不同标签的前20本提及次数最多的书都比较相似。

我的下一步：改善标签。

方法是：在找到每个标签对应的前20本提及次数最多的书之后，排除掉之前已经处理过的书。因为这是一次性工作，我决定用PostgreSQL数组，编程语言如下：

SELECT * 
, ARRAY(SELECT UNNEST(isbns) EXCEPT SELECT UNNEST(to_exclude )) 
, ARRAY_UPPER(ARRAY(SELECT UNNEST(isbns) EXCEPT SELECT UNNEST(to_exclude )), 1) 
FROM ( 
SELECT * 
, ARRAY[‘isbn1’, ‘isbn2’, ‘isbn3’] AS to_exclude 
FROM ( 
SELECT 
tag 
, ARRAY_AGG(DISTINCT isbn) AS isbns 
, COUNT(DISTINCT isbn) 
FROM ( 
SELECT * 
FROM ( 
SELECT 
it.* 
, t.popularity 
FROM isbn_tags AS it 
LEFT OUTER JOIN isbns AS i on i.isbn = it.isbn 
LEFT OUTER JOIN tags AS t on t.tag = it.tag 
WHERE it.tag in ( 
SELECT tag 
FROM tags 
ORDER BY popularity DESC 
LIMIT 1 OFFSET 0 
) 
ORDER BY post_count DESC LIMIT 20 
) AS t1 
UNION ALL 
SELECT * 
FROM ( 
SELECT 
it.* 
, t.popularity 
FROM isbn_tags AS it 
LEFT OUTER JOIN isbns AS i on i.isbn = it.isbn 
LEFT OUTER JOIN tags AS t on t.tag = it.tag 
WHERE it.tag in ( 
SELECT tag 
FROM tags 
ORDER BY popularity DESC 
LIMIT 1 OFFSET 1 
) 
ORDER BY post_count 
DESC LIMIT 20 
) AS t2 
UNION ALL 
SELECT * 
FROM ( 
SELECT 
it.* 
, t.popularity 
FROM isbn_tags AS it 
LEFT OUTER JOIN isbns AS i on i.isbn = it.isbn 
LEFT OUTER JOIN tags AS t on t.tag = it.tag 
WHERE it.tag in ( 
SELECT tag 
FROM tags 
ORDER BY popularity DESC 
LIMIT 1 OFFSET 2 
) 
ORDER BY post_count DESC 
LIMIT 20 
) AS t3 
... 
UNION ALL 
SELECT * 
FROM ( 
SELECT 
it.* 
, t.popularity 
FROM isbn_tags AS it 
LEFT OUTER JOIN isbns AS i on i.isbn = it.isbn 
LEFT OUTER JOIN tags AS t on t.tag = it.tag 
WHERE it.tag in ( 
SELECT tag 
FROM tags 
ORDER BY popularity DESC 
LIMIT 1 OFFSET 78 
) 
ORDER BY post_count DESC 
LIMIT 20 
) AS t79 
) AS tt 
GROUP BY tag 
ORDER BY max(popularity) DESC 
) AS ttt 
) AS tttt 
ORDER BY ARRAY_upper(ARRAY(SELECT UNNEST(arr) EXCEPT SELECT UNNEST(la)), 1) DESC;

既然已经有了所需要的数据，我开始着手建立网站。

建立网站

因为我不是一个网页开发人员，更不是一个网络用户界面专家，所以我决定创建一个基于默认主题的十分简单的单页面app。

我创建了“标签查找”的选项，然后提取最受欢迎的标签，使每次查找都可以点击相应选项来搜索。

我用长条图来可视化搜索结果。尝试了Hightcharts和D3（分别为两个JavaScript数据可视化图表库），但是他们只能起到展示作用，在用户响应方面还存在一些问题，而且配置起来很复杂。所以我决定用SVG创建自己的响应式图表，为了使图表可响应，必须针对不同的屏幕旋转方向对其进行重绘。

var w = $('#plot').width; 
var bars = "";var imgs = ""; 
var texts = ""; 
var rx = 10; 
var tx = 25; 
var max = Math.floor(w / 60); 
var maxPop = 0; 
for(var i =0; i < max; i ++){ 
if(i > books.length - 1 ){ 
break; 
} 
obj = books[i]; 
if(maxPop < Number(obj.pop)) { 
maxPop = Number(obj.pop); 
} 
} 
for(var i =0; i < max; i ++){ 
if(i > books.length - 1){ 
break; 
} 
obj = books[i]; 
h = Math.floor((180 / maxPop ) * obj.pop); 
dt = 0; 
if(('' + obj.pop + '').length == 1){ 
dt = 5; 
} 
if(('' + obj.pop + '').length == 3){ 
dt = -3; 
} 
var scrollTo = 'onclick="scrollTo(\''+ obj.id +'\'); return false;" "'; 
bars += '<rect id="rect'+ obj.id +'" class="cla" x="'+ rx +'" y="' + (180 - h + 30) + '" width="50" height="' + h + '" ' + scrollTo + '>'; 
bars += '<title>' + obj.name+ '</title>'; 
bars += '</rect>'; 
imgs += '<image height="70" x="'+ rx +'" y="220" href="img/ol/jpeg/' + obj.id + '.jpeg" onmouseout="unhoverbar('+ obj.id +');" onmouseover="hoverbar('+ obj.id +');" width="50" ' + scrollTo + '>'; 
imgs += '<title>' + obj.name+ '</title>'; 
imgs += '</image>'; 
texts += '<text x="'+ (tx + dt) +'" y="'+ (180 - h + 20) +'" class="bar-label" style="font-size: 16px;" ' + scrollTo + '>' + obj.pop + '</text>'; 
rx += 60; 
tx += 60; 
} 
$('#plot').html( 
' <svg width="100%" height="300" aria-labelledby="title desc" role="img">' 
+ ' <defs> ' 
+ ' <style type="text/css"><![CDATA[' 
+ ' .cla {' 
+ ' fill: #337ab7;' 
+ ' }' 
+ ' .cla:hover {' 
+ ' fill: #5bc0de;' 
+ ' }' 
+ ' ]]></style>' 
+ ' </defs>' 
+ ' <g class="bar">' 
+ bars 
+ ' </g>' 
+ ' <g class="bar-images">' 
+ imgs 
+ ' </g>' 
+ ' <g class="bar-text">' 
+ texts 
+ ' </g>' 
+ '</svg>');

网页服务失败

Nginx 还是 Apache？

当我发布了 dev-books.com这个网站之后，它有了大量的点击。而Apache却不能让超过500个访问者同时访问网站，于是我迅速部署并将网站服务器调整为Nginx。说实在的，我对于能有800个访问者同时访问这个网站感到非常惊喜！

如果你有任何问题，可以通过twitter（https://twitter.com/VLPLabs ）和Facebook（https://www.facebook.com/VLP-Labs-727090070789985/）联系作者。

原文链接：

https://medium.freecodecamp.org/i-analyzed-every-book-ever-mentioned-on-stack-overflow-here-are-the-most-popular-ones-eee0891f1786

【今日机器学习概念】

Have a Great Definition

志愿者介绍

上一篇： web技术进阶:学习高并发技术书籍推介，看完从入门到跑路
下一篇： Nginx缓存原理及机制（nginx304缓存）

网站首页 > 技术教程正文

文本分析了4000万条Stack Overflow讨论帖，这些是程序员最推荐的编程书(附代码)

猜你喜欢

本文暂时没有评论，来添加一个吧(●'◡'●)

取消回复欢迎你发表评论:

网站首页 > 技术教程 正文

文本分析了4000万条Stack Overflow讨论帖，这些是程序员最推荐的编程书(附代码)

猜你喜欢

本文暂时没有评论，来添加一个吧(●'◡'●)

取消回复欢迎 你 发表评论:

网站首页 > 技术教程正文

取消回复欢迎你发表评论: