问题描述:

So I have this csv file of size 380 MB or so. I created an empty data structure as this one:

{ ID1 { day1 : [ [flow,hour1],[flow, hour2] ...[flow, hour23]], day2: [...]...day30:[...]}, ID2...}. I extract from the csv and filled this structure with the loop below, which takes about 3 minutes. there are about 2000 IDs, each 30 days, each 24 hours. Then when I try to dump this filled structure in a json file, it took hours and the output file's size exceeded 3 GBs before i quit the script. Since JSON is supposed to be more compact, is this supposed to happen? because I tried with smaller scales (1000 entries) and it worked well. Is there a good way to deal with this? Thank you.

NOTE: 'stations' is a list of stations that row['ID'] should match too.

import csv

import json, pprint, datetime, time

meta_f = open( metadata_path , 'rb' )

meta_read = csv.DictReader(meta_f,delimiter='\t')

hour_f = open(hourly_path,'r')

hour_read = csv.DictReader(hour_f, delimiter=',')

stations = []

no_coords = []

for i,row in enumerate(meta_read):

if not row['Longitude'] or not row['Latitude']:

no_coords.append(row['ID'])

elif in_box(row,bound):

stations.append(row['ID'])

data={}

number_of_days=30

days={}

for i in range(1,number_of_days+1):

days[i]=[]

for station in stations:

data[int(station)]=days

with open('E:/pythonxy/Projects/UP/json_data.txt','wb') as f:

json.dump({},f)

f.close()

with open('E:/pythonxy/Projects/UP/json_data.txt','rb') as f:

d=json.load(f)

#i=0

t0=time.time()

for row in hour_read:

#if i>1000:

#break

if row['Station'] in stations:

#print row['Station']

t=datetime.datetime.strptime(row['TimeStamp'], '%m/%d/%Y %H:%M:%S')

data[int(row['Station'])][int(t.day)]+=[[row['TotalFlow'],t.hour]]

#i+=1

#print i

d.update(data)

print time.time()-t0

t0=time.time()

with open('E:/pythonxy/Projects/UP/json_data.txt','wb') as f:

json.dump(d,f)

f.close()

print time.time()-t0

print 'DONE'

网友答案:
for station in stations:
    data[int(station)]=days

Every entry you create in data with this loop refers to the same dict as a value. That means every time you add something to any data[something] dict, you add it to all of them. The result when you dump it to a file is wrong and huge. To avoid this, you can deep copy the days dict:

from copy import deepcopy
for station in stations:
    data[int(station)] = deepcopy(days)
网友答案:

Not really an answer per se, but JSON is actually much less compact than CSV. Take this example.

CSV:

X,Y,Z
1,2,3
4,5,6

JSON:

[{X:1,Y:2,Z:3},{X:4,Y:5,Z:6}]

That's 17 bytes for CSV and 29 for JSON!

相关阅读:
Top