neohua的个人博客分享 http://blog.sciencenet.cn/u/neohua

博文

2014/1/17

已有 3111 次阅读 2014-1-17 16:55 |个人分类:机器学习|系统分类:科研笔记|关键词:学者

 问题二:需要将list格式数据转化为矩阵形式

head(city.state)

[[1]]

[1] " Iowa City" " IA"      


[[2]]

[1] " Milwaukee" " WI"      


[[3]]

[1] " Shelton" " WA"    


[[4]]

[1] " Columbia" " MO"      


[[5]]

[1] " Seattle" " WA"    


[[6]]

[1] " Brunswick County" " ND"    


location.matrix<-do.call(rbind,city.state)

head(location.matrix)

    [,1]                [,2]

[1,] " Iowa City"        " IA"

[2,] " Milwaukee"        " WI"

[3,] " Shelton"          " WA"

[4,] " Columbia"         " MO"

[5,] " Seattle"          " WA"

[6,] " Brunswick County" " ND"


问题三:将处理后的数据并入原始数据中

ufo<-transform(ufo,USCity=location.matrix[,1],USState=tolower(location.matrix[,2]),stringsAsFactors=FALSE)


问题四:进一步清洗非指定数据

##定义一个us.tasates向量匹配数据

us.states<-c("ak","al","ar","az","ca","co","ct","de","fl","ga","hi","ia","id","il","in","ks","ky","la","ma","md","me","mi","mn","mo","ms","mt","nc","nd","ne","nh","nj","nm","nv","ny","oh","ok","or","pa","ri","sc","sd","tn","tx","ut","va","vt","wa","wi","wv","wy")

##match匹配函数

ufo$USState<-us.states[match(ufo$USState,us.states)]

##将不匹配的数据对应城市设为NA

ufo$USCity[is.na(ufo$USState)]<-NA

##建立新的数据

ufo.us<-subset(ufo,!is.na(USState))


##对时间数据进行整体性认识

summary(ufo.us$DateOccurred)


##做直方图对频度进行考察

quick.hist<-ggplot(ufo.us,aes(x=DateOccurred))+geom_histogram()

##构建新的数据框

ufo.us<-subset(ufo.us,ufo.us$DateOccurred>=as.Date("1990-01-01"))

nrow(ufo.us)

##取得年月信息

ufo.us$YearMonth<-strftime(ufo.us$DateOccurred,format="%Y-%m")


问题五:分类统计数据的方法

##plyr库是一种数据聚合工具

library(plyr)

##按每个州每个年月里的ufo目击次数统计

sightings.counts<-ddply(ufo.us,.(USState,YearMonth),nrow)


问题六:处理缺失值

##需要把没有数据的区间用0补完整

##创建一个时间序列

date.range<-seq.Date(from=as.Date(min(ufo.us$DateOccurred)),to=as.Date(max(ufo.us$DateOccurred)),by="month")

[1] "1990-01-01" "1990-02-01" "1990-03-01" "1990-04-01" "1990-05-01" "1990-06-01"

##将日去掉统一格式

date.strings<-strftime(date.range,"%Y-%m")

head(date.strings)

[1] "1990-01" "1990-02" "1990-03" "1990-04" "1990-05" "1990-06"

##lapply函数是生成长度为X的一个列表,该列表中的每个元素是施加FUN到X的相应元素中的结果!!!结果集是列表

states.dates<-lapply(us.states,function(s)cbind(s,date.strings))

##将列表转化为数据框

states.dates<-data.frame(do.call(rbind,states.dates),stringsAsFactors=FALSE)

##merge函数两个数据框合并

all.sightings<-merge(states.dates,sightings.counts,by.x=c("s","date.strings"),by.y=c("USState","YearMonth"),all=TRUE)

##改列名

names(all.sightings)<-c("State","YearMonth","Sightings")

##更改日期格式为了方便用数学方法统计,应尽量不用string格式!!!

##将na的值付值为0

all.sightings$Sightings[is.na(all.sightings$Sightings)]<-0

##将日期设置为date格式

all.sightings$YearMonth<-as.Date(rep(date.range,length(us.states)))

##将字符串改为factor类型并大写

all.sightings$State<-as.factor(toupper(all.sightings$State))


对数据进行初步清洗,对地点用州、城市分别表示。对日期进行补全并用0替换NA。

对数据进行初步分类,对每个城市,不同时间分别统计




https://m.sciencenet.cn/blog-785542-760045.html

上一篇:2014/1/14
下一篇:2014/1/20

0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

数据加载中...

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-6-2 20:23

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部