问题描述:

I have a system uploading data from Twi tter's Streaming API to a MySQL database in two steps:

1) I have data coming in from R to two separate files a day, an AM file and PM file, via code adapted from streamR.

2) Twice a day, I have the data parsed and organized in R, and then uploaded to a MySQL database using RMySQL.

There is a lag in the data showing up in MySQL - by 2-3 days, and this lag seems to keep increasing. RMySQL returns "DONE" after uploading, but the data are not queryable until a few days later. My collaborators pride themselves on real-time analysis of tweet data, so this is a nuisance.

Parsing/uploading code:

 library(streamR)

tweets<-parseTweets("parse/after12.json")

#clean data so that mysql doesnt freak

cleanText <- function(text_vector){

text_vector<-sapply(text_vector, function(x) gsub("\'","\"",x))

text_vector<-sapply(text_vector, function(x) gsub("[\b\t\n]"," ",x))

return(text_vector)

}

tweets[,1:length(tweets)]<-cleanText(tweets[,1:length(tweets)])

#change date data to a format mysql can read

tweets$created_at<-as.POSIXlt(tweets$created_at, format= "%a %b %d %X %z %Y",tz="UTC")

library(chron)

library(RMySQL)

# Establish connection to database

mgr <- dbDriver("MySQL")

mycon <- dbConnect(mgr, user="user", host="name", dbname="dbname", password="pwd")

currentTime<-Sys.time()

currentDate<-unlist(lapply(strsplit(as.character(currentTime), " "), function(x) x[1]))

currentDate<-unlist(lapply(strsplit(as.character(currentDate), "-"), function(x) x[1:3]))

tableDate<-paste(currentDate[1], currentDate[2], sep="_") # to load into

dbWriteTable(mycon, paste("data_", tableDate, sep=""), tweets1, row.names=FALSE, append=T)

#upload complete

Also not that there is no indication that a process is running on the machine that holds the mysql database when this lag happens.

Any suggestions would be appreciated!

相关阅读:
Top