问题描述:

I'm importing a table from mysql to hive using Sqoop. Some columns are latin1 encoded. Is there any way to do either:

  1. Set the encoding for those columns as latin1 in Hive. OR
  2. Convert the columns to utf-8 while importing with sqoop?

网友答案:

In Hive --default-character-set is used to set the character set for whole database not specific to few columns. I was not able to find Sqoop parameter which will convert tables columns to utf-8 in fly rather the columns are expected to set type fixed.

$ sqoop import --connect jdbc:mysql://server.foo.com/db --table bar \
--direct -- --default-character-set=latin1

I believe you would need to convert Latin1 columns to utf-8 first in your MySql and then you can import from Sqoop. You can use the following script to convert the all the columns into utf-8, which I found here.

mysql --database=dbname -B -N -e "SHOW TABLES" | \
awk '{print "ALTER TABLE", $1, "CONVERT TO CHARACTER SET utf8 COLLATE \
utf8_general_ci;"}' | mysql --database=dbname &
网友答案:

Turned out the problem was unrelated. The column works fine regardless of encoding...but the table's schema had changed in mysql. I assumed that since I'm passing in the overwrite flag, sqoop would remake the table every time in Hive. Not so! The schema changes in mysql didn't get transferred to Hive, so the data in the md5 column was actually data from a different column.

The "fix" we settled on was, before every sqoop import check for schema changes, and if there was a change, drop the table and re-import. This forces a schema update in Hive.

Edit: my original sqoop command was something like:

sqoop import --connect jdbc:mysql://HOST:PORT/DB --username USERNAME --password PASSWORD --table uploads --hive-table uploads --hive-import --hive-overwrite --split-by id --num-mappers 8 --hive-drop-import-delims --null-string '\\N' --null-non-string '\\N'

But now I manually issue a drop table uploads to hive first if the schema changes.

相关阅读:
Top