How to deal with the requirement to order a DataFrame using more than one column simultaneously? Also, consider some values … More
Author: Piyush
How to retain the first row of each ‘group’ in a PySpark DataFrame?
For a given dataframe, with multiple occurrence of a particular column value, one may desire to retain only one (or … More
Installing ‘s3cmd’ on centOS 7+
AWS S3 is one of the oldest offerings for ‘object storage’, from Amazon AWS and when working through a terminal, … More
How to split contents of a column into *only two* by a delimiter?
While working with PySpark, I came across a requirement, where data in a column had to be split using delimiters … More
Change java for Cloudera cluster using ambari
If you are using Hortonworks (presently Cloudera) cluster, the default value for java is Oracle JDK 1.8. If you have … More
How to encrypt files before sharing online?
As a Data Engineer, one faces the need to share files securely over the internet. An easy way of doing … More
Optimized Import of Text Data for Analytics
Text data analysis is a staple use case for the Data Analytics world! There are multiple firms which enable capture … More
How to copy a file, recreating the directory structure, using python?
To copy “Folder1/Folder2/file1” to “Folder3” source structure: Folder1/FolderA ………………….(not to be copied)Folder1/fileX…………………………(not to be copied)Folder1/Folder2/file1 desired destination structure: Folder3/Folder1/Folder2/file1 P.S. … More