Terminal - join command

From NoskeWiki
Jump to navigation Jump to search

About

This page is a child of: Terminal commands


The "join" command in unix / bash, inputs sorted text files - typically sorted by the "sort" command. Here's an example of how it works, trying to sort two comma separted


Join Example

Our input files:

$ cat person_gender.csv
Person,Gender
Jared,m
Lina,f
Andrew,m
Zorro,m

$ cat person_pet.csv
Person,Pet
Andrew,dog
Lina,bird
Jared,rabbit
Jared,fish

Now make sorted copies (on 1st column):

$ sort -t , -k 1 person_gender.csv > person_gender_SORTED.csv
$ sort -t , -k 1 person_pet.csv > person_pet_SORTED.csv
  • The -t , says to use commas as the delimiter.

Now join them on common column:

$ join -t, -1 1 -2 1 person_gender_SORTED.csv person_pet_SORTED.csv > merged_sorted.csv
$ cat merged_sorted.csv
Andrew,m,dog
Jared,m,fish
Jared,m,rabbit
Lina,f,bird
Person,Gender,Pet
  • Notice the sorting step will probably move your header row... I'm not sure how to get around that.
  • Notice Zorro got dropped because he doesn't have a pet.... you could add -a 1 to add him in from first file


Man Page

$ join --help
Usage: join [OPTION]... FILE1 FILE2
For each pair of input lines with identical join fields, write a line to
standard output.  The default join field is the first, delimited
by whitespace.  When FILE1 or FILE2 (not both) is -, read standard input.

  -a FILENUM        also print unpairable lines from file FILENUM, where
                      FILENUM is 1 or 2, corresponding to FILE1 or FILE2
  -e EMPTY          replace missing input fields with EMPTY
  -i, --ignore-case  ignore differences in case when comparing fields
  -j FIELD          equivalent to '-1 FIELD -2 FIELD'
  -o FORMAT         obey FORMAT while constructing output line
  -t CHAR           use CHAR as input and output field separator
  -v FILENUM        like -a FILENUM, but suppress joined output lines
  -1 FIELD          join on this FIELD of file 1
  -2 FIELD          join on this FIELD of file 2
  --check-order     check that the input is correctly sorted, even
                      if all input lines are pairable
  --nocheck-order   do not check that the input is correctly sorted
  --header          treat the first line in each file as field headers,
                      print them without trying to pair them
      --help     display this help and exit
      --version  output version information and exit

Unless -t CHAR is given, leading blanks separate fields and are ignored,
else fields are separated by CHAR.  Any FIELD is a field number counted
from 1.  FORMAT is one or more comma or blank separated specifications,
each being 'FILENUM.FIELD' or '0'.  Default FORMAT outputs the join field,
the remaining fields from FILE1, the remaining fields from FILE2, all
separated by CHAR.  If FORMAT is the keyword 'auto', then the first
line of each file determines the number of fields output for each line.

Important: FILE1 and FILE2 must be sorted on the join fields.
E.g., use "sort -k 1b,1" if 'join' has no options,
or use "join -t ''" if 'sort' has no options.
Note, comparisons honor the rules specified by 'LC_COLLATE'.
If the input is not sorted and some lines cannot be joined, a
warning message will be given.

Links