大数据舆情挖掘,看图说话。 先看近一个月来在社会媒体上的希拉里和川普的品牌形象对比图: 看点三: 1 川普的 buzz 大过 希拉里一倍多,川普是话题中心(圈的大小表明热议度) 2. 普罗对川普比对希拉里,情绪更趋激烈:表现在 Y 轴的 passion intensity 上 3. 两人总体都不讨人喜欢,川普更加让人厌恶,表现在 x 轴上的 Net Sentiment(也就是褒贬对比的度量)。两人都在冰点之下,社会媒体的形象不佳。 如果我们要自动调查过去一个月时间的趋向和形象消长,可以考虑把数据分割为两段或三段来看此消彼长,先一分为二来看图: 看到了吧,过去一个月,随着总统大选辩论和丑闻的揭示和宣传,川普的媒体形象显著恶化,表现在舆情圈圈从右(x轴上的右是评价度高 love like,左边是评价度低 hate dislike)向左的位移。本来评价度clearly比希拉里要好,终于比希拉里差了。同时,希拉里的社会媒体形象有所改善,圈圈在从左向右位移。两个人始终都是冰点以下,吐槽多于赞美,但是就在一个月前,还是喜妈更不受待见: 不是民众更喜欢老川,而是普罗更厌恶喜妈 。 这个品牌对比图示表达了四维信息: 1. net sentiment 评价度 x 轴 2. passion intensity 舆情烈度 y 轴 3. buzz 圈圈的大小,是热议度 4. 一分为二的两个圈是时间的粗线条切割的维度 在二维的图纸上,要表达四维的信息,的确不是很容易。 要是嫌第四维时间太粗线条,咱们一分为三看看: 三个圈,浓度的深浅表达的是时间的远近。当短短的一个月的时间,被一分为三的时候,我们看到了什么趋向呢?请注意颜色的深浅,对应的是时间的远近。我们看到,喜妈的三个圈圈是左下角到右上(还是visualization设计不到家,不同品牌应该用不同的颜色区分才好)。原来喜妈的评价是先好,后坏,最后回到中间。而老川在同一个时间点,是先中,后略好,最后跌入深渊。 以上是利用我们自创的品牌对比图(有美国专利的)来看候选人的形象消长。 社会媒体数据的来源呢?Twitter 为主: 这是一个月来的舆情总结: 的确是大数据了,一个月的随机的社会媒体数据样本里面,两人的 mentions 就有近两亿,眼球数共计高达3万6千亿。川普占7成,喜妈才三成。川普跟冰冰类似,都是话题之王。 总体社会评价,川普零下20%,喜妈零下18%。 下面是有关川普的社煤数据选摘: Bill Clinton disgraced the office with the very behavior you find appalling in Trump. In closing, yes, maybe Trump does suffer from a severe case of CWS. Instead, in this alternate NY Times universe, Trump’s campaign was falling apart. Russian media often praise Trump for his business acumen. This letter is the reason why Trump is so popular Trump won I'm proud of Trump for taking a stand for what's right. Kudos to Trump for speaking THE TRUTH! Trump won I’m glad I’m too tired to write Trump/Putin fuckfic. #trump won Trump is the reason Trump will lose this election. Trump is blamed for inciting violence. Breaking that system was the reason people wanted Trump. I hate Donald Trump for ruining my party. 32201754 Trump is literally blamed by Clinton supporters for being too friendly with Russia. Another heated moment came when Trump delivered an aside in reponse to a Clinton one-liner. @dka_gannongal I think Donald Trump is a hoax created by the Chinese.... Skeptical_Inquirer The drawing makes Trump look too normal. I'm proud of Donald Trump for answering that honestly! Donald grossing me out with his mouth features @smerconish @realdonaldtrump Controlling his sniffles seems to have left Trump extraordinarily exhausted Trump all the way people trump trump trump Trump wins Think that posting crap on BB is making Trump look ridiculous. I was proud of Trump for making America great again tonight. MIL is FURIOUS at Trump for betraying her! @realdonaldTrump Trump Cartel Trump Cartel America is already great, thanks to President Obama. Kudos to Mr Trump for providing the jobs!! The main reason to vote for Trump is JOBS! Yes donal trump has angered many of us with his WORDS. Trump pissed off a lot of Canadians with his wall comments. Losing this election will make Trump the biggest loser the world has ever seen. Billy Bush's career is merely collateral damage caused by Trump's wrenching migration. So blame Donald for opening that door. The most important reason I am voting for Trump is Clinton is a crook. Trump has been criticized for being overly complimentary of Putin. Kudos to Trump for reaching out to Latinos with some Spanish. Those statements make Trump's latest moment even creepier. I'm mad at FBN for parroting the anti-Trump talking points. Kudos to Trump for ignoring Barack today @realDonaldTrump Trump has been criticized for being overly complimentary of Putin. OT How Donald Trump's rhetoric has turned his precious brand toxic via The Independent. It's these kinds of remarks that make Trump supporters look like incredible idiots. Trump is blamed for inciting ethnic tensions. Trump is the only reason the GOP is competitive in this race. Its why Republicans are furious at Trump for saying the voting process is rigged. Billy Bush’s career is merely collateral damage caused by Trump’s wrenching migration. Donald Trump is the dumbest, worst presidential candidate your country has EVER produced. I am so disappointed in Colby Keller for supporting Trump. Billy Bush’s career is merely collateral damage caused by Trump’s wrenching migration. In swing states, Trump continues to struggle. Trump wins Co-host Jedediah Bila agreed, saying that the move makes Trump look desperate. Trump wins Trump attacks Clinton for being bisexual! TRUMP win Pence also praised Trump for apologizing following the tape’s disclosure. In swing states, Trump continues to struggle. the reason Trump is so dangerous to the establishment is he is unapologetically alpha. 关于希拉里的社会媒体样本数据摘选: Hillary deserves worse than jail. Congratulations to Hillary her campaign staff for wining three Presidential debates. I HATE @chicanochamberofcommerce FOR INTRODUCING THAT HILLARY GIF INTO MY LIFE As it turns out, Hillary creeped out a number of people with her grin. Hillary trumped Trump Trump won! Hillary lost Hillary violated the Special Access Program (SAP) for disclosing about the nuclear weapons!! I trust Flint water more than Hillary Hillary continued to baffle us with her bovine feces. NEUROLOGISTS HATE HILLARY FOR USING THIS TRADE SECRET DRUG!!!!... CONGRATULATIONS TO HILLARY CLINTON FOR WINNING THE PRESIDENCY Supreme Court: Hillary is our only choice for keeping LGBT rights. kudos to hillary for remaining sane, I'd have killed him by now How is he blaming Hillary for sexually assaulting women. He's such a shithead The only reason I'm voting for Hillary is that Donald is the only other choice Hillary creeps me out with that weird smirk. Hillary is annoying asf with all of her laughing I credit Hillary for the Cubs waking up When you listen to Hillary talk it is really stupid On the other hand, Hillary Clinton has a thorough knowledge by virtue of her tenure as Secretary of State. Americans deserve better than Hillary Certain family members are also upset with me for speaking out against Hillary. Hillary is hated by all her security detail for being so abusive Hillary beat trump The only reason to vote for Hillary is she's a woman. Certain family members are also upset with me for speaking out against Hillary. I am glad you seem to be against Hillary as well Joe Pepe. Hillary scares me with her acions. Unfortunately Wikileaks is the monster created by Hillary democrats. I'm just glad you're down with evil Hillary. Hillary was not mad at Bill for what he did. She was mad he got caught. Just like she is not ashamed of what she did she is angry she got caught. These stories are falling apart like Hillary on 9/11 Iam so glad he is finally admitting this about Hillary Clinton. Why hate a man for doing nothing like Hillary Clinton Hillary molested me with a cigar while Bill watched. You are upset with Hillary for doing the same as all her predecessors. I feel like Hillary Clinton is God's punishment on America for its sins. Trumps beats Hillary You seem so proud of Hillary for laughing at rape victims. Of course Putin is going to hate Hillary for publicly announcing false accusations. Russia is pissed off at Hillary for blaming the for wikileaks! Hillary will not win. Good faith is stronger than evil. Trump wins🇺🇸 I am proud of Hillary for standing up for what is good in the USA. Hillarys plans are worse than Obama Hillary is the nightmare the people have created. Funny how the Hillary supporters are trashing Trump for saying the same thing. 🇺🇸🇺🇸🇺🇸🇺🇸🇺🇸🇺🇸 I am so proud of the USA for making Hillary Clinton president. Hillary, you're a hoax created by the Chinese Trump trumps Hillary During the debate, Trump praised Hillary for having the will to fight. Trump is better person than Hillary Donald TRUMPED Hillary Kudos to Hillary for her accomplishments. He also praised Hillary for handling the situation with dignity. During the debate, Trump praised Hillary for having the will to fight. People like Hillary in senate is the reason this country is going downhill. Hillary did worse than expectations. Trump will prosecute Hillary for her crimes, TRUMP will! Have to praise Hillary for keeping her focus. a landslide victory for Hillary will restore confidence in American democracy vindicated I was so proud of Hillary tonight for acting like a tough, independent woman. I dislike Hillary Clinton, as I think she is a corrupt, corporate shill. Hillary did worse than Timmy Kaine Im so glad he finally brought Benghazi against Hillary Hillary, thank you for confirmation that the Wikileaks documents are authentic and you did that tonight when you accused the Russians of hacking your servers! We the people deserve better than you! Supreme Court justices is the only reason why I'd vote for Hillary. Massive kudos to Hillary for keeping her cool with that beast behind her. Congrats to Hillary for actually answering the questions. She's spot on. #debate 【相关】 Big data mining shows clear social rating decline of Trump 【关于舆情挖掘】 《朝华午拾》总目录 【关于立委NLP的《关于系列》】 【置顶:立委NLP博文一览】 【 立委NLP频道 】
活生生的大数据,活生生的实时展示。 特别是两党内部总统候选人提名的政策辩论,以及两党候选人的几场总统竞选辩论,来自社会媒体(主要是推特)大数据的舆情实时监测,比传统民调高明许多:反映民情及时、准确、客观,数据点高出传统民调好几个量级。 下面的链接中,点击头像可以立马实时监测舆情的瞬时变化: http://bit.ly/1LiSXrg #NBDebate This is our live social media monitoring for the debate. We did it before during the last election, and it is ridiculously making sense. 奥巴马赢了昨晚辩论吗?舆情自动检测告诉你: http://blog.sciencenet.cn/blog-362400-623922.html 如今,至少过去一个小时的实时舆情显示,喜大妈远落后于其他两位民主党候选人。点击三位候选人的头像可以立马看到各自的舆情指数 net-sentiment,反映的是他们的 popularity。 http://www.netbase.com//democraticdebates2016/candidates_competitive_view.html 过去一个小时的舆情指数是: 10/13 2015 5pm 喜大妈: -22 http://www.netbase.com//democraticdebates2016/hillaryclinton_livepulse.html Joe Biden: +39 http://www.netbase.com//democraticdebates2016/joebiden_livepulse.html Bernie Sanders: +53 http://www.netbase.com//democraticdebates2016/berniesanders_livepulse.html 零下 22 度啊,怎么这么惨呢。我本来还指望她成为历史第一任美国女总,把社会主义的全民健康医保推向深入,并且推进移民改革,让技术移民更容易。 【相关博文】 奥巴马赢了昨晚辩论吗?舆情自动检测告诉你 世人皆错nlp不错,民调错大数据也不会错 2015-10-15 【置顶:立委科学网博客NLP博文一览(定期更新版)】
BIG DATA MINING. Our multilingual mining of global social media on Apple's recent iPhone 6 launch. Data so big that it involves 18 million mentions in various languages and 1.7 million unique authors in a data set of only last one month history. Rarely in the human history did one product launch generate so much interest and attention with impressions (eyeball potential) as many as 91 billion, in such a short time period, despite no obvious revolution in the features except for the form factor (and perhaps Apple Pay). Truly amazing. 这才叫大数据,全球社交媒体不到一个月就提及一千八百万次,潜在眼球效应指标达 910 亿(impressions 这项指标说的最多可能够着的眼球观览数),有 170 多万网友参与议论吐槽: 18,061,000 Mentions 91,012,338,159 Potential Impressions 46% Net Sentiment 1,509,698 Positive 561,523 Negative 1,737,978 Unique Authors 好,有这么多数据可以挖掘,非机器不能了,看看都挖掘出啥来。 先看一个月的趋势:热议高峰是九月七号,好评如潮,从较好的褒贬度(net sentiment)27 一路上扬到目前的 57,平均得分高达 46:显然是一次成功的产品发布。 最热关键词: 最热话题: 再看网民的情绪。自从盘古开天地,有哪款产品刚上市就会在全球不同民族不同语言中引起如此轰动,引发如此多的情绪性议论评价。爱的爱死,气的气死,嫉妒的咬牙切齿。 买还是不买,要看口袋。推荐还是吐槽,阵线分明。据说由于大陆土豪的迫不及待,香港的 iPhone 6 已经炒到天价了。买的不仅仅是消费品,而是炫富和身份的手段。 优劣任人评说: 全球五大洲热议,北冰洋除外: 数据分布,Twitter 为最: 多语言吐槽样本: 【相关】 苹果智能手表会是可穿戴设备的革命么? 【置顶:立委科学网博客NLP博文一览(定期更新版)】
最近用自家产品做了一次关于沃尔玛的自动调查,总体来看,沃尔玛这个品牌似乎蛮受欢迎的,正面评价为主,褒贬指数达到正48,是相当不错了。指责抱怨也有,主要针对一些负面事件(狐狸肉冒充牛肉、对伪劣产品乱发合格证上架等)。进一步挖掘(drill down)发现了令人惊奇的现象:好话大多是网民自发的评价,而挖掘出来的负面信息几乎一律出自国家新闻机构(CCTV等)的报道。社会媒体挖掘的本意是自动民调,了解客户对于品牌和产品的意见,正式新闻有机构或国家宣传的因素在,是应该加以区分的。可是目前,这种区分还做得不好,很多有影响的传统媒体的新闻被反复在社会媒体中转发传播,与民意混杂在一起。 Some further analysis and findings: 1. The existing data are not very large (400k mentions a year), but the results make sense with decent data quality 2. From geos stats, we know most data on Walmart come from China (dark color) instead of overseas sources 3. From domains stats, the data actually include data from Sina Weibo ( weibo.com ) and Tencent Weibo ( t.qq.com ) although the data flow from these two important Microblog sources is not stable at this point. Also the domains stats show that the major domains are all from China. I know that Walmart is a very influential brand in China and has many stores in cities of China. 4. The net sentiment 48% is fairly high, which is reflected in the emotions stats ( data quality very good ) : big green fonts emotional terms include 放心 (piece of mind) ,喜欢 (like) ,乐 (happy) ,支持 / 推 (support) ,很好 (very good), 不错 (not bad) ,成功 (success) etc. The negative emotional words (in small red font) are not many, including 差劲 (bad) ,抱怨 (complain) ,不喜欢 (dislike) ,垃圾 (garbage) ,很一般 (very so-so: meaning not as good as expected). 5. In the proscons word cloud, the likes include money-saving (省钱/便宜)and first-class service(服务一流); more interesting insights come from the dislikes, including (1) fake beef (using fox meat 狐狸肉事件); (2) recall (召回some product?); (3) cheating(欺诈); (4) scandal(丑闻) etc. 6. In order to drill down to see what negative incidents led to the above dislikes, the Walmart_con_sample shows some related sound bites which look like negative news on some incidents: 1 st sound bite reports CCTV news on Walmart’s fake alcohol and fake meat (using fox meat) incidents; 2 nd sound bite reports using fox meat to fake beef and donkey meat and using chicken to fake beef in the sold burgers at its Sam’s Club; the third sound bite reports three incidents of Walmart at different times and its apologies, including using cheap frozen meat to fake organic green food; using cheap fox meat to fake beef; and its lack of quality control in importing low quality products for sale, having issued 200 permits within 7 years for disqualified products to be on shelf. 7. Note that the above sound bites are selectively collected to show that our system can indeed capture detailed negative incidents of the brand in the media. When I drill down, there are quite some duplicates in our sound bites (one bad news gets re-posted everywhere); another thing is that the negative comments are not mainly from social media users, but from news (state-run news which get posted in social media too). 8. Unlike the overwhelming positive terms in emotions word cloud and the summary, the behavior word cloud shows more or bigger negative behavior terms than the positive terms. This is understandable because of the heavily reported incidents as shown above in the sample sound bites. Eye-catching negative behavior terms include “revealed”(被曝), “take to court”/”being sued”(告上法庭); “closed”(关闭); “have to take off shelf” (下架)etc. 9. From the above negative behavior terms, I drilled down to see more details in the sample sound bites below, which is similar to the sample discussed in 6. These two sound bites both come from negative news of Walmart, which originated from traditional news and got spread all over Internet. 中国新闻媒体对美国的跨国公司的负面报道跟民意没什么关系,倒往往由某种国际关系的大气候所致。当年为了打压谷歌,硬是给谷歌搜索按上了黄色监管不力的莫须有的大帽子,无视国内的搜索、视频和很多其他网站黄色泛滥到令人发指的露骨程度。欲加之罪,何患无辞。 不仅如此,最近还听说,由于中美相互指责对方利用网络偷窃情报,IT 业关系恶化,以至于谷歌和苹果等公司在中国遭到进一步打压,连做学问的信息利器 Google Scholar 都被封杀了。造孽啊,城门失火,殃及池鱼。 【置顶:立委科学网博客NLP博文一览(定期更新版)】
UPDATE:立委愚人节北京讲演时间地点已经确认,感谢中文信息学会孙教授的邀请和安排,也感谢董振东前辈教授的建议和推举: The loacation is : Room 334, 3rd floor, building 5 Institute of Software, Chinese Academy of Sciences, No. Zhongguancun South 4th Street 10:00~12:00 It's better you take the subway. And the nearest subway station of line 13 is 知春路 虽然在四月一日路过北平,但不是愚人节玩笑 :=), 具体地点和活动细节待确认后随时update Sentiment Mining from Chinese Social Media in Big Data Age by Wei Li, Ph.D. Computational Linguistics In this information age of big data, social media such as WeiBo (Micro-Blog, or Chinese twitter) is more and more influential. The popularity of mobile devices such as smart phones makes it possible for anyone to share his/her observation, experiences, opinions and sentiments any time anywhere in the social network such as WeiXin (or WeChat). The social media big data from WeiBo, WeiXin, Customer Review sites, Blogs and Forums are like a gold mine of intelligence, yet to be mined. They are in the form of natural language (Chinese in this case) and contain intelligence of public opinions and consumer sentiments on any topics, brands and products. Automated sentiment mining via Natural Language Processing (NLP) is a must-do if we (or businesses) do not want to be overwhelmed by the information overload. Dr. Li's talk will present the design philosophy behind such a sentiment mining system which he has designed and led the team to develop. He will first discuss the value and scope of NLP in sentiment extraction and mining, pros and cons between the rule based system and learning based classification, and different levels of sentiment mining in response to the various information needs. He will then demonstrate a list of real life Chinese social media hot topics as mined by the system to show the value and future of big data and NLP, in areas like automatic survey and social media listening and monitoring for consumer insights. 大数据时代中文社会媒体的舆情挖掘 李维 博士 随着大数据时代的到来,社会媒体(譬如 微博)的影响力日益增强。智能手机等移动设备的普及,使得普罗百姓的见闻、意见和情绪可以随时随地传达(譬如利用微信)。微博、微信、博客、论坛这些社会媒体大数据好像一座座富含情报的金山,等待我们去挖掘。在大数据面前,如果不想被信息爆炸淹没,就必然需要使用自动手段,尤其是可以用来自动抽取挖掘舆情的自然语言技术。 李博士的报告基于他主持开发的客户舆情自动抽取挖掘系统。报告分两大部分。第一部分阐述自然语言技术在舆情抽取中的应用范围,比较统计分类方法与规则系统方法的利弊,以及舆情分析的层级体系。第二部分通过一系列社会媒体热点话题的实例,展示大数据挖掘的价值和前景。 Dear Prof, Li, ...... the title and abstract of your talk in Chinese or English. And a simple cv of you. How about 10:00~12:00am ? About Dr, Li A hands-on computational linguist with nearly 30 years of professional experience in Natural Language Processing (NLP), Dr. Li has a track record of making NLP work robust. He has built three large-scale NLP systems, all transformed into real-life, globally distributed products. He is now Chief Scientist for a fast-growing Silicon Valley company which serves global Fortune 500 companies for consumer insights and social media monitoring. 【相关活动: 台北学术讲演谈中文语法分析 】 【置顶:立委科学网博客NLP博文一览(定期更新版)】
ABSTRACT Brand Passion Index (BPI) is used to help us make an informed decision in our on-going purchase of a new washer. Using our own product, we generated two BPIs, one to compare the major washer brands in the US market and the other to compare front loading vs. top loading. With the collective consumer insights in mind, we have narrowed our choices to front loading Maytag or LG. This is a live case of big data win.. 最近决定要购买一套最新洗衣机烘干机,不求最贵,务求最好。领导清洗老洗衣机不胜其烦(也不老,才用了两年),说这次一定要看准品牌买,绝不上当。两年前的洗衣机是人送的杂牌,不是自选的,没有品牌选择的过程。结果,虽然衣服洗出来确实很干净,但问题也不少。除了噪音颇大外,还存在难以容忍的缺陷:门圈藏污纳垢,似有霉变,难于清洗,于是想到,淘汰他个,一了百了。 于是请教老友,有竭力推荐 Maytag 品牌者,尽数其洗衣有如神功,安静、省水、无损,洗衣干净透亮,比当年浣女棒槌槌出来的还牛。隔在以前,有这样可信老友的竭力推荐,我直接下单买了就是,可如今不同了,信息社会,还是多听听多比较,才能确保心安(to make an informed decision)。洗衣机不大不小,每日相伴,基本功能大同小异,但是买错了也不好退回去,只有像以前一样将就用,一用就是n年,烦不烦。 有道是,书上得来终觉浅,口碑虽好量太少,只有一二老友而已。 (老友的推荐可以加权,一句顶10句,或100句,但是还有千百万的品牌舆情在,至少也该综合一下情报不是?) 咋办? 求助大数据(BIG data)。 大数据里面有的是“口碑”,散落在社会媒体各处。人的本性之一就是说话,古今中外概莫能外。家庭主妇为最,唠叨是其生存的方式和重要理由(之一):甭管是喜爱还是抱怨,不说就会憋死。前信息时代,唠叨也就唠叨了,说出的话,与泼出的水一样,gone with wind,随风蒸发,毫无价值。如今不同了,有社会媒体和智能手机,甭管你多婆婆妈妈,围脖也好,Facebook也罢,都有档案在,均可以转化为宝贵的情报,关键是要有挖掘的功力。 人是不行的,但机器人如 NLP (Natural Language Processing)是可以的,不信咱们拿它来挖挖看,现场演示一下我学(行)了一辈子的 NLP 如何帮助领导尚在进行中的决策,来一个活生生的技术改变世界和你我的示范。 第一步,领导要先了解一下洗衣机主要品牌的社会总体评价及其比较。没问题,我们独家NLP技术支持的系统,就有这个功能,可以随时生成任何产业的【多品牌舆情图】。只要把几个美国市场的品牌送进去,图表就出来了。 上图依据海量挖掘的 社会媒体口碑(数据见下),对 美国家电市场洗衣品牌排座次,关注度、褒贬度以及热情度三维并列、二维展示,不仅一目了然,而且也颇性感养眼,不是? 从上面的图示我(你)们看到了什么? 有说看到了科学的力量,那你是科学主义者。说看到了技术的力量,那你就是又一个立委。领导看到的既不是科学也不是技术,而是恍然大悟:眼前一亮,原来如此;再接再厉,继续挖掘。 领导说,不怪老友推荐 Maytag,你看Maytag 位于舆情图最左边,说明喜欢它的客户都很粉它,大概与苹果迷喜爱爱疯类似,说明该品牌一定有迷人之处。但褒贬指数 Maytag 却不是最高,在它之上的还有 LG,虽然不如喜欢 Maytag 那么狂热,但LG综合品质显然独占鳌头。这有点儿意外。 领导继续说,更加意外的是,没想到 Whirlpool 的客户评价这么低,差不多落到了 GE 的档次。昨天去家电中心,售货员还说 Maytag 就是 Whirlpool,一回事,都是 Whirlpool 公司所产,没想到二者在客户心目中的地位完全不同。所以,作为决策的第一步,Whirlpool 已经出局,绝对不听售货员的推荐买它。( 后来与老友印证了这两个品牌的关系,原来 Maytag 就好比丰田产的豪华档凌志,而 Whirlpool 则是大众档 Camry 或经济档 Corrolla,完全不在一个层次。) 第三个启发是,北美专业洗衣店使用最广泛的 Kenmore 品牌,评价也不高,只比三星略强,因此也差不多出局了。店员跟我们说的是,Kenmore 品牌洗衣机其实是 LG 产的(就如 Maytag 是 Whirlpool 所产一样),但是舆情却把二者明显分开了。 初步决定在 Maytag 和 LG 中再做挑选。需要进一步的证据,功能细节及其评价。 得,进一步挖掘,自家的工具不要钱,eating my own dog food,不用白不用。当然,这要等下回分解了。 【相关篇什】 《大数据时代的购物策略:洗衣机寻购记(2)》 《大数据时代的购物策略:洗衣机寻购记(3)》 洗衣机的“东西”观 【置顶:立委科学网博客NLP博文一览(定期更新版)】
美国的枪支管制自从上次小学校园惨案以后,再度提上日程,也成为社会媒体的热门话题。 有朋友要做这个课题,希望我帮助利用我们产品去挖掘社会媒体的网民呼声以及统计数据。结果出来以后,发现反对枪支管制的人还是多于支持管制的人,感觉很失望。 就我个人而言,我是恨透了美国的枪支泛滥,生活没有安全感。惨案后不久,有人在网络上征集签名提交给白宫网站,要求加强 枪支管制,我自然欣然参与。奥巴马连任后,开始把枪支管制和移民改革作为优先任务。他指派副总统拜登负责枪支管制事宜,白宫也想借助民意促成一些管制法令的通过。于是,我的电子邮箱,不时会收到从白宫发来的总统和副总统签字的信件,鼓励我们发出更大的呼声,形势似乎不错。 实际上,这条路还很长、很长。 以前以为,反对管制的主要是美国步枪协会和枪支制造销售商,现在发现美国社会普罗百姓对于枪支管制持怀疑反对的人也很不少。普遍流行的一个似是而非的观点是,枪不杀人,人杀人。没有枪,刀也杀人,石头也杀人,甚至拳头也杀人。如果这个道理成立的话,禁止核武器就完全没有理由了。人很多时候不是理性动物,高效杀人武器握在人手里就是个定时炸弹。 美国很多优点,是个比较理想的移民国家,但枪支泛滥是其为数不多的致命缺点之一。一黑遮百美,就凭枪支泛滥这一条,我劝尚在做美国梦的后生在最终决定移民前三思而行。在日本、新加坡的都市深夜(在我记忆中的祖国,也基本如此),甚至半夜也可以看到年轻女性行走在大街上,并没有恐惧感。这种事情在美国是不可思议的。 好了,不多说了,说起来烦死人。 还是面对现实吧。看下列从英文社会媒体挖掘出来的数据吧。 on gun control 日期: 02/05/2013 18:04:58 1. it is talked about most in the last 2-3 months ( 康州惨剧,总统掉眼泪之后引发的大讨论 ) 2. there was quite some discussion between July-Auguest last year (应该是 上次的 蝙蝠侠 恶性枪击事件引发的 ) 3. not a hot topic in other times So let us first focus on the last 3 months Gun Control topic 3-months summary 1. mentions: 1,409,922 2. impressions: 938,597,694 (we call social media reach, roughly eye-balls on this topic) 3. comments: 1,006,548 4. net sentiment: -21% (more people dislike gun control than support gun control, a REAL surprise to me) 5. positive mentions: 40,876 6. negative mentions: 62,199 Word Clouds of Top Terms and Top Attributes on Gun Control 支持和反对GC的主要理由,数据来源,主要作者,男女比例,Sample data 你看,反对枪支管制的最大理由是影响了守法公民的权利(这是宪法 第二 修正案所保障的:据说当年的宪法是怕政府暴政,所以要藏枪于民,人民在忍无可忍的时候,可以组成民兵,有个揭竿而起,造反有理的选项,听上去几乎就是列宁主义者的设计,吸收了 马列主义暴力革命的精髓。如今这条理由早已不适应时代了,谁要是相信美国会出现引发暴力革命并可以以暴力革命来摆平的暴政,那是红卫兵思维,脑筋有问题。最多也就是占领华尔街运动,和平请愿为主 )。你没事玩枪干嘛?即便打猎,当今社会提倡动物保护主义,你也无处可打啊。唯一拿得上台面的理由是自卫和阻遏。可是,一个社会倚仗个人武装来自卫,不是很滑稽可笑么。 其他的理由,正方反方大多针锋相对:支持管制者认为这样可以有效减少恶性事件 (effective solution / work well / reduce crime and violence) 和 拯救生命 (save life),反对者坚持说这根本不是解决途径(no solution / not solve anything / pointless / impossible / ineffective),也不能减少犯罪 (not reduce crime / not stop gun violence / not lower gun death), 甚至有说枪支管制反而会增加暴力犯罪 (increase violent crime),当然还有认为这个有缺陷的政策 (flawed policy) 是非法的 (illegal)。这些观点究竟得到多少认真研究此问题的专家数据的支持,不得而知。更多的情形是,多数人是屁股决定脑袋,先有自己的观点,然后选择性看待和解读数据。 The first sample sounds like sarcastic, not really supporting gun control per se . Not sure. Anyway, sarcasm remains difficult to decode (sometimes even human has difficulty). The second is a popular voice against gun control. 【置顶:立委科学网博客NLP博文一览(定期更新版)】
【Brand Passion Index 3: international fast food brands in China market face challenges】 Chinese Social Media Mining: Brand Passion Index for international fast food brands McDonald's, Pizza Hut, KFC and Yoshinoya in China. Fairly negative. The golden time when McDonald's and KFC first entered China market is gone with wind. In this country known for taste and delicacy , they face customers who are difficult to satisfy and severe competition from the inexpensive Chinese food from inside China. 从现在开始,【社媒挖掘】专栏打算每周至少发布一次以【品牌形象图】为主的社会媒体的自动调查报告,选取不同领域大众热议的流行品牌。大数据时代已经来临,社会媒体对我们日常生活以及企业发展的影响越来越大,利用自然语言技术深度挖掘社会媒体的舆论和情绪势在必行。否则企业和客户都会淹没在大数据的海洋中,盲人摸象,坐井观天:企业维护品牌形象很难,消费者选取品牌也会无所适从。我们这个系列同时可以作为语言技术展示的一个窗口。 今天要挖掘的 topic 是快餐行业的国际知名品牌。且看看它们在中国的口碑和形象如何。 所选的四家快餐品牌是麦当劳、肯德基、必胜客和吉野家。上图显示,肯德基最为人们热议(buzz),超过麦当劳。这一点与美国有鲜明对比,肯德基在美国本土根本 无法与快餐的航空母舰麦当劳相提并论,肯德基 现在只是一个小土豆,一度几乎破产(后来经过关并,与 Taco Bell 合营,以及自身改革,局面才有好转:改革包括在传统过分油腻的炸鸡之上,增加了口味也相当不错的 grilled chicken)。但这四家快餐店只有必胜客的形象还算正面,处于褒贬议论的中线上。其他三家均在中线之下,表明客户的抱怨多于喜爱。在舆论强度的轴上,麦当劳刚好在中线上,表明讨厌它咒骂它的人都不少,其他两家(肯德基和吉野家)尽管总体形象也是负面的,但大家抱怨的强度不烈。必胜客呢,虽然总体形象不错,却与吉野家一样处于情绪强度的最左边,说明喜欢它和抱怨它也都不激烈。下面是褒贬情绪的词云之一,绿褒红贬,没有什么大起大落: 更进一步,客户到底喜欢他们什么,又抱怨什么呢?我们把前三家快餐的前 15 项褒贬的缘由挖掘图示如下: 曾几何时,以国际餐饮大王麦当劳为代表的西方快餐店纷纷进军中国市场,后来东洋的吉野家也步其后尘,想分一杯羹。当年国门乍开,国人对西洋东洋的东西甚觉新鲜,清洁卫生规范快捷的外来快餐店在东土大受欢迎,一时门庭若市。还记得肯德基在北京刚开张的时候,我和领导全家拥进去吃肯德基的那种大快朵颐的开心。感觉上,那是我一辈子吃到的最美味的鸡(也奇怪了,后来来美国发现,肯德基味道大不如印象,老觉得是鸡原料不如东土的缘故)。岳母大人吃的很开心,说:这鸡比爷爷做得不差(领导家爷爷是北京名厨,在部机关掌厨,常为部长服务)。然而,中国毕竟是舌尖上的中国,中国人对吃最挑剔,最讲究。在食这一块儿,要想长期扎下去赚钱,与本土的各种经济便餐以及农家小菜竞争,其实并不容易。外来快餐,首先是价格上没有优势,其次是口味太单调。从上图也可以看出,老百姓对这些外来餐饮不满多於喜爱。 【数据来源】自动民调的数据来自中文世界社会媒体过往一年的档案,简体文档三亿五千万。 大约有一亿论坛帖子来自百度(贴吧等),两千多万来自搜狐,两千五百万来自天涯论坛。 【立委名言:技术改变世界,数据影响生活】 【置顶:立委科学网博客NLP博文一览(定期更新版)】
一个偶然的系统测试,暴露出百度与“哪里有小姐”身影相随。这个发现在朋友间立即引起轩然大波,有称妙的(way to go, u r onto sth),有调侃的(曰:百度本来就源自“众里寻她千百度”嘛),有怀疑的( the results are not faked? )。阴谋论者伊妹儿我,指责此云有侮辱百度之嫌。 我跟老友说:我没有结论。有牢骚的话也是借题发挥(讽刺据传是平西王当年以扫黄为名打压挤走谷歌,为百度开道),不是正经“结论”,不足采信。但是我有数据,怎么解读这个数据见仁见智。要想发现背后的真相,还需要一番深入调查的功夫。 先谈数据: 百度在所调查的一年跨度的社会媒体统计中共出现近 227 万次,其中“哪里有小姐”与它共现 50 万次,是关联度最高的 term (占据与其共现的 top 100 关联词语之首,share:22%),这就是词云出来的背景数据: 什么是词云呢? A word cloud displays the frequently occurring terms surfacing from a topic's text. 从一年到半年、三个月、一个月、一周、一日,永远是小姐为主题,邪门了 是不是百度上的某种广告,这么黏糊,百度甩也甩不开。竞价排名惹的祸? 请看六个月 的词云数据图: 三个月 的词云数据图: 一个月 的词云数据图: 一周的词云数据图: 一天 的词云数据图: 再看对同样的社会媒体同样的一年时段的“谷歌”的调查结果 谷歌 出现的总次数远不如 百度,只有 73万4千,但也足够多 到可以观察其关联词了 Let US Drill down: 百度小姐的真相在这里 是什么样的推手把 小姐 与 百度快照 弄得满世界都是 日期: 12/14/2012 17:40:43 一定是有人编制了程序,到各网站(包括宠物网站)张贴小姐的广告及其百度快照。 Drill down 发现很多链接,Spam 一样,点了链接进去大多已经失效了,大概已经被网管删除。 大概是删不胜删。 最后在百度直接做了一下“哪里有小姐”的搜索,果然是东土最响亮的广告词。 最后在百度直接做了一下“哪里有小姐”的搜索,果然是东土最响亮的广告词。 前一篇博文: 社会媒体测试知名品牌百度,有惊人发现 【置顶:立委科学网博客NLP博文一览(定期更新版)】
Different social images and social media sentiments for Ma Yingjiu, Taiwan President, and Chen Shuibian, Taiwan former president. 不同的社会媒体评价,截然不同的民间形象,台湾现总统马英九 vs 台湾前总统陈水扁,社会媒体自动分析的初步结果凸显二者的不同形象和风格。 (1) 高频情绪性词的词频分析的对照图示 (2) 高频褒贬描述性词的词频分析的对照图示 相关篇什: 研究发现,国人爱说反话:夸奖的背后藏着冷笑 【置顶:立委科学网博客NLP博文一览(定期更新版)】
拖了这么久,中文系统的初步试验终于开始 日期: 09/06/2012 21:04:35 本来核心系统的开发最难,最耗时间 ,结果在真实生活中,工程架构、存贮和搞定content这些纯技术性操作性环节往往也会成为时间瓶颈,怪也不怪。 这次试验只有海外twitter和百度贴吧天涯论坛等来源的半年数据,但做出的分析也蛮有意思。 I did a test on comparing Google and Baidu for side-by-side view of likes, dislikes, net sentiments, sources, etc. They make sense, even with such limited data. So to summarize the different opinions of these two search giants from social media in Chinese : 1. Google's net sentiment is very high, around 70 while Baidu's net sentiment is only 35: 谷歌社会评价度高出百度整整一倍! 2. most striking likes for Google are Cooperative, Innovation, Updated, Optimized and Robust. The likes for Baidu are optimized, updated, and new. The dislikes of Google are Monopoly, abandoning Android, cannot open it (that is in fact not a problem of Google, it is Chinese Great Wall's problem). The dislikes of Baidu are unstable, drop, and misleading. There are also a few obvious bugs too, like very easy misclassified as dislikes. 【置顶:立委科学网博客NLP博文一览(定期更新版)】
粗粗看了一下最近6个几度、6期的Government Information Quarterly(GIQ,政府信息季刊)上的文章,最近一期12篇文章有3篇讨论Social Media(社会媒体)在电子政务服务中的应用;所有6期近80篇文章中有1/5讨论体制问题对电子政务的影响;所有6期近80篇文章中有近一半讨论电子政务服务。