My Big Data Journey with Cloudera

  1. Structured Data Ingestion using Apache Sqoop
  2. Unstructured / Semi-Structured Data Ingestion using Apache Kafka / Flume.
Fig. 1
Fig. 2
[cloudera@quickstart ~]$ mysql -uroot -pcloudera
Fig. 3
mysql > show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| cm |
| firehose |
| hue |
| metastore |
| mysql |
| nav |
| navms |
| oozie |
| retail_db |
| rman |
| sentry |
+--------------------+
12 rows in set (0.00 sec)
mysql > use retail_db;mysql > show tables;
Fig. 4
[cloudera@quickstart ~]$ sqoop list-databases \
> --connect jdbc:mysql://quickstart:3306/ \
> --username root \
> --password cloudera
Fig. 5. Note: You should use -P instead of --password in command-line.
[cloudera@quickstart ~]$ sqoop list-tables \
> --connect jdbc:mysql://quickstart:3306/retail_db \
> --username root \
> --password cloudera
Fig. 6 Note: Same List as Fig. 4. You should use -P instead of — password in command-line.
[cloudera@quickstart ~]$ sqoop import \
> --connect jdbc:mysql://quickstart:3306/retail_db \
> --username root \
> --password cloudera \
> --table customers
Fig. 7
Fig. 8 The transfer speed will largely depend on the number of nodes on which the job is running. Here I am using a virtual machine which is for demonstration purposes only.
Fig. 9
[cloudera@quickstart ~]$ hadoop fs -ls
Fig. 10
Fig. 11
[cloudera@quickstart ~]$ sqoop import \
> --connect jdbc:mysql://quickstart:3306/retail_db \
> --username root \
> --password cloudera
> --table customers
> --target-dir cust1
....[cloudera@quickstart ~]$ hadoop fs -ls
Fig. 12
[cloudera@quickstart ~]$ sqoop import \
> --connect jdbc:mysql://quickstart:3306/retail_db \
> --username root \
> --password cloudera \
> --table customers \
> --target-dir customerdata \
> --m 5 \
> --where "customer_id>12430"
Fig. 13
[cloudera@quickstart ~]$ hadoop fs -cat /user/cloudera/customerdata/part-*
Fig. 14. Note: As shown in Fig. 13. Splits = 5.
SYNTAX: sqoop import \
--query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' --split-by a.id --target-dir /user/foo/joinresults
EXAMPLE:[cloudera@quickstart ~]$ sqoop import \
> --query 'select a.*, b.* from products as a, categories as b where a.product_category_id=b.category_id AND $CONDITIONS limit 5' \
> -m1 \
> --target-dir customprodcats
> --connect jdbc:mysql://quickstart:3306/retail_db \
> --username root
> --password cloudera
20/10/09 14:58:57 INFO mapreduce.ImportJobBase: Transferred 761 bytes in 70.8253 seconds (10.7447 bytes/sec)
20/10/09 14:58:57 INFO mapreduce.ImportJobBase: Retrieved 5 records.
[cloudera@quickstart ~]$ sqoop import-all-tables \
-m 1 \
--connect jdbc:mysql://quickstart:3306/retail_db \
--username=retail_dba \
--password=cloudera \
--compression-codec=snappy \
--as-parquetfile \
--warehouse-dir=/user/hive/warehouse \
--hive-import
[cloudera@quickstart ~]$ hadoop fs -ls /user/hive/warehouse/
[cloudera@quickstart ~]$ hadoop fs -ls /user/hive/warehouse/categories/
mysql> create database empdb;Query OK, 1 row affected (0.01 sec)mysql> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| cm |
| empdb |
| firehose |
| hue |
| metastore |
| mysql |
| nav |
| navms |
| oozie |
| retail_db |
| rman |
| sentry |
+--------------------+
13 rows in set (0.00 sec)
mysql> CREATE TABLE employees (
-> emp_id int not null,
-> emp_name varchar(240)
-> );
Query OK, 0 rows affected (0.12 sec)mysql> show tables;
+-----------------+
| Tables_in_empdb |
+-----------------+
| employees |
+-----------------+
1 row in set (0.01 sec)
mysql> insert into employees values(1, 'Tutorial Purposes');
Query OK, 1 row affected (0.04 sec)
mysql> insert into employees values(2, 'Tutorial 2 Purposes');
Query OK, 1 row affected (0.03 sec)
[cloudera@quickstart ~]$ sqoop import \
> --connect jdbc:mysql://quickstart:3306/empdb \
> --username root \
> --password cloudera \
> --table employees \
> -m1
[cloudera@quickstart ~]$ hadoop fs -ls
Found 4 items
drwxr-xr-x - cloudera cloudera 0 2020-10-08 16:04 cust1
drwxr-xr-x - cloudera cloudera 0 2020-10-08 16:14 customerdata
drwxr-xr-x - cloudera cloudera 0 2020-10-08 15:35 customers
drwxr-xr-x - cloudera cloudera 0 2020-10-09 04:18 employees
[cloudera@quickstart ~]$ hadoop fs -ls employees
Found 2 items
-rw-r--r-- 1 cloudera cloudera 0 2020-10-09 04:18 employees/_SUCCESS
-rw-r--r-- 1 cloudera cloudera 41 2020-10-09 04:18 employees/part-m-00000
mysql> insert into employees values(3, 'Tutorial3 Purposes'), (4, 'Tuts 4'), (5, 'Tuts 5');
Query OK, 3 rows affected (0.04 sec)
Records: 3 Duplicates: 0 Warnings: 0
[cloudera@quickstart ~]$ sqoop import \
> --connect jdbc:mysql://quickstart:3306/empdb \
> --username root \
> --password cloudera \
> --table employees \
> --icremental append \
> --check-column emp_id \
> --last-value 2 \
> -m1
FIg. 15
[cloudera@quickstart ~]$ hadoop fs -ls employees
Found 3 items
-rw-r--r-- 1 cloudera cloudera 0 2020-10-09 04:18 employees/_SUCCESS
-rw-r--r-- 1 cloudera cloudera 41 2020-10-09 04:18 employees/part-m-00000
-rw-r--r-- 1 cloudera cloudera 39 2020-10-09 04:39 employees/part-m-00001
[cloudera@quickstart ~]$ hadoop fs -cat employees/part*
1,Tutorial Purposes
2,Tutorial2 Purposes
3,Tutorial3 Purposes
4,Tuts 4
5,Tuts 5
[cloudera@quickstart ~]$
[cloudera@quickstart ~]$ sqoop import-all-tables \
> --connect jdbc:mysql://quickstart:3306/retail_db \
> --username root \
> --password cloudera \
> --warehouse-dir all_retail
[cloudera@quickstart ~]$ sqoop job --create himyjob -- import  --connect jdbc:mysql://quickstart:3306/empdb --username root --password cloudera --table employees --incremental append --check-column emp_id --last-value 0 -m1
20/10/09 15:29:40 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.12.0
20/10/09 15:29:44 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
[cloudera@quickstart ~]$ sqoop job --list
20/10/09 15:31:28 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.12.0
Available jobs:
himyjob
[cloudera@quickstart ~]$ sqoop job --exec himyjob
20/10/09 15:38:30 INFO tool.ImportTool: Saving incremental import state to the metastore
20/10/09 15:38:31 INFO tool.ImportTool: Updated data for job: himyjob
[cloudera@quickstart ~]$ hadoop fs -ls employees
Found 1 items
-rw-r--r-- 1 cloudera cloudera 80 2020-10-09 15:38 employees/part-m-00000
[cloudera@quickstart ~]$ hadoop fs -cat employees/part*
3,Tutorial3 Purposes
4,Tuts 4
5,Tuts 5
1,Tutorial Purposes
2,Tutorial2 Purposes
mysql> insert into employees values(6, ‘Tuts 6’), (7, ‘Tuts 7’), (8, ‘Tuts 8’);[cloudera@quickstart ~]$ sqoop job --exec himyjob
...
...
20/10/09 15:45:47 INFO mapreduce.ImportJobBase: Transferred 27 bytes in 69.3158 seconds (0.3895 bytes/sec)
20/10/09 15:45:47 INFO mapreduce.ImportJobBase: Retrieved 3 records.
20/10/09 15:45:47 INFO util.AppendUtils: Appending to directory employees
20/10/09 15:45:47 INFO util.AppendUtils: Using found partition 1
20/10/09 15:45:47 INFO tool.ImportTool: Saving incremental import state to the metastore
20/10/09 15:45:48 INFO tool.ImportTool: Updated data for job: himyjob
[cloudera@quickstart ~]$ hadoop fs -ls employees
Found 2 items
-rw-r--r-- 1 cloudera cloudera 80 2020-10-09 15:38 employees/part-m-00000
-rw-r--r-- 1 cloudera cloudera 27 2020-10-09 15:45 employees/part-m-00001
[cloudera@quickstart ~]$ hadoop fs -cat employees/part*
3,Tutorial3 Purposes
4,Tuts 4
5,Tuts 5
1,Tutorial Purposes
2,Tutorial2 Purposes
6,Tuts 6
7,Tuts 7
8,Tuts 8
mysql> truncate table employees;
Query OK, 0 rows affected (0.05 sec)
mysql> select * from employees;
Empty set (0.00 sec)
[cloudera@quickstart ~]$ sqoop export \
> --connect jdbc:mysql://quickstart:3306/empdb \
> --username root \
> --password cloudera \
> --table employees \
> --export-dir employees
20/10/09 06:42:42 INFO mapreduce.ExportJobBase: Transferred 764 bytes in 100.6668 seconds (7.5894 bytes/sec)
20/10/09 06:42:42 INFO mapreduce.ExportJobBase: Exported 5 records.
mysql> select * from employees;
+--------+--------------------+
| emp_id | emp_name |
+--------+--------------------+
| 3 | Tutorial3 Purposes |
| ..| ........ |
+--------+--------------------+
8 rows in set (0.00 sec)

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store