Pig Latin parsing CSV files with quoted commas

less than 1 minute read

In the not too distant past, I was working on a BigData engagement using Apache Pig. I took CSV parsing for granted and expected it to just work, however if you have quoted strings with commas, it won’t behave as you’d expect.


1,"This is a sample sentence, same sentence, just happens to include a few commas" 

When you use:

load 'input/oneLiner.txt' using PigStorage(',') 

It delimits based on the comma, regardless of it being in a quoted string, so you end up with 4 fields;

This is a sample sentence
same sentence
just happens to include a few commas

The solution to this is to use a custom loader, such as org.apache.pig.piggybank.storage.CSVExcelStorage().

To get started with this, I had to clone the piggybank repository (collection of user defined functions, why this didn’t make it to the base release I’m not entirely sure) and build from source, unfortunately I didn’t keep any notes for this, but its relatively straightforward, see the Apache Pig wiki page here