问题描述:

When I try to load multiple files from cloud storage larger jobs almost always fail. When I try to load an individual file that works, but loading batches is really much more convenient.

Snippet:

Recent Jobs

Load 11:24am

gs://albertbigquery.appspot.com/uep/201409/01/wpc_5012_20140901_0002.log.gz toalbertbigquery:uep.201409

Load 11:23am

gs://albertbigquery.appspot.com/uep/201409/01/wpc_5012_20140901_0001.log.gz toalbertbigquery:uep.201409

Load 11:22am

gs://albertbigquery.appspot.com/uep/201409/01/* toalbertbigquery:uep.201409

Errors:

File: 40 / Line:1 / Field:1, Bad character (ASCII 0) encountered: field starts with: <�>

File: 40 / Line:2 / Field:1, Bad character (ASCII 0) encountered: field starts with: <5C���>}�>

File: 40 / Line:3 / Field:1, Bad character (ASCII 0) encountered: field starts with: <����W�o�>

File: 40 / Line:4, Too few columns: expected 7 column(s) but got 2 column(s). For additional help:

File: 40 / Line:5, Too few columns: expected 7 column(s) but got 1 column(s). For additional help:

File: 40 / Line:6, Too few columns: expected 7 column(s) but got 1 column(s). For additional help:

File: 40 / Line:7, Too few columns: expected 7 column(s) but got 1 column(s). For additional help:

File: 40 / Line:8 / Field:1, Bad character (ASCII 0) encountered: field starts with: <��hy�>

The worst with this problem is that I don't know which file is "File: 40" the order seems random, otherwise I could remove that file and load the data, or try to find the error in the file.

I also strongly doubt that there even is an actual file error, for example in the above case when I removed all files but _0001 and _0002 (that worked fine to load as single files) I still get this output:

Recent Jobs

Load 11:44am

gs://albertbigquery.appspot.com/uep/201409/01/* toalbertbigquery:uep.201409

Errors:

File: 1 / Line:1 / Field:1, Bad character (ASCII 0) encountered: field starts with: <�>

File: 1 / Line:2 / Field:3, Bad character (ASCII 0) encountered: field starts with:

File: 1 / Line:3, Too few columns: expected 7 column(s) but got 1 column(s). For additional help:

File: 1 / Line:4 / Field:3, Bad character (ASCII 0) encountered: field starts with:

Sometimes though the files load just fine, otherwise I'd expect that multiple file loading was all wrecked.

Info:

Average file size is around 20MB, usually a directory is 70 files somewhere between 1 and 2 GB.

网友答案:

It looks like you're hitting a BigQuery bug.

When BigQuery gets a load job request with a wildcard pattern (i.e. gs://foo/bar*) we first expand the pattern to the list of files. Then we read the first one to determine the compression type.

One oddity with GCS is that there isn't a real concept of a directory. That is gs://foo/bar/baz.csv is really bucket: 'foo', object: 'bar/baz.csv'. It looks like you have empty files as placeholders for your directories (as in gs://albertbigquery.appspot.com/uep/201409/01/).

This empty file doesn't play nicely with the bigquery probe-for-compression type, since when we expand the file pattern, the directory dummy file is the first thing that gets returned. We then open the dummy file, and it doesn't appear to be a gzip file, so we assume the compression type of the entire load is uncompressed.

We've filed a bug and have a fix under testing. Hopefully the fix will be out next week. In the mean time, your options are to either expand the pattern yourself, to use a longer pattern that won't match the directory (as in gs://albertbigquery.appspot.com/uep/201409/01/wpc*), or to delete the dummy directory file.

相关阅读:
Top