Given JSON Data
{
“Person”: {
“FirstName”: “Piyush”,
“LastName”: “Routray”,
“date”: [1990, 10, 28]
},
“Friends”: [{
“Name”: “Vaibhav”,
“State”: “UP”
}, {
“Name”: “Salah”,
“State”: “Kerala”
}, {
“Name”: “Bilal”,
“State”: “Odisha”
}]
} {
“Person”: {
“FirstName”: “Ananya”,
“LastName”: “Mohanty”,
“date”: [1990, 09, 04]
},
“Friends”: [{
“Name”: “Piyush”,
“State”: “Odisha”
}, {
“Name”: “Sarmi”,
“State”: “Odisha”
}, {
“Name”: “Shah Jahan”,
“State”: “Odisha”
}]
}
Start with validating the JSON data at http://jsonlint.com/
Then, to import data to a hive table, first we should remove the line breaks. Unexpected newlines do not work well with Hadoop due to the input splits which are created according to the presence of line breaks. Sure, it decreases the readability but is important due to the reason mentioned above.
Sample script I used for the same.
import os
directory = raw_input('Enter the directory path (eg. ./2016)')
for root, dirs, files in os.walk(directory):
for name in files:
filename = os.path.join(root, name)
f = open(filename)
data = f.read().replace("\n", "").replace("\t", "").replace(" ", "").replace("}{", "}\n{")
f.close()
filename = "NewFolder/" + name
f = open(filename, 'w')
f.write(data)
f.close()
Upload the file to the hdfs. In our case, assume we are in ‘ /tmp/proutray/jsonfolder/
‘.
Now, download and save the relevant jar for your Hive version from here.
For ease, I assume we are in the same folder as the jar was downloaded. Now, pasting the output from the terminal.
[hdfs@ip-10-0-3-67 ~]$ ls
json-serde-1.1.9.9-Hive1.2-jar-with-dependencies.jar
[hdfs@ip-10-0-3-67 ~]$ hive
hive> add jar json-serde-1.1.9.9-Hive1.2-jar-with-dependencies.jar;
Added [json-serde-1.1.9.9-Hive1.2-jar-with-dependencies.jar] to class path
Added resources: [json-serde-1.1.9.9-Hive1.2-jar-with-dependencies.jar]
hive>
CREATE EXTERNAL TABLE IF NOT EXISTS jsonTable (
Person STRUCT ,
Friends ARRAY
) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION '/tmp/proutray/jsonfolder/';
OK
Time taken: 0.193 seconds